 You just need to reload it. If you're on a macro of Linux, in your terminal, you just have to retype your SSH command. That's it, right? Everybody clear? So when we go faster to get into the cloud, that's all you have to do, right? Everybody get, okay, all right. All right. Well, let's talk about reference guided alignment. And it sounds like a really weirdy way just to talk about normal alignment. But this is just to prevent confusion. Some people, when they talk about alignment, they're really thinking about de novo assembly, meaning you don't know what the actual reference sequence looks like ahead of time. That's how you try and reconstruct that. This is basically the re-sequencing part of the equation where you already know roughly what the reference sequence looks like, and you just want to align reads to that reference sequence. This is my baby. You know, you're just trolling me because when I say things like that, it ends up on this PowerPoint. As does that comment. All right, so the way, yes. It was actually quite awesome about two years ago because we had another instructor from life, and we just like trade quips back and forth. It got ugly, so Francis had to be sort of a referee. All right, the way I've broken down this lecture is into four main sections. In the beginning, let's just talk about what does the data look like before you even start thinking about alignment because that has a lot of impact on how you're going to align the reads. Then we'll talk roughly about what are the different aligner characteristics. We'll also talk about there's a couple of best practices that are involved after you've done the normal alignment that will really help in whatever subsequent analyses you're going to do, especially if you're doing variant calling. And then the last part, about a year ago, I led an effort at Illumina to basically figure out, okay, which aligners do we truly want to use? And so during that process, we really figured out, how would you go about testing that? What factors actually play an important role when you're trying to decide which aligner to use? Besides just asking your favorite bioinformatician, like, oh, what's your favorite liner? And he's like, oh, I'll tie all the way. Now you can actually make an informed decision. So when I started off in this field, and it was actually the exact same time that Aaron started off in this field, there was basically Sanger Capillary sequencing and the 454 machine was out in the market. Things were pretty easy back then. And then by the time I graduated, bam, this happened. And so basically, now you have machines from Illumina, you have the AB solid machines. You actually can find out about Helicos, Complete Genomics, PacBio, IonTorrent. The question marks sort of a film for everything that really hasn't truly come out yet. You could possibly put Oxford Nanopore there or many other sequencing technologies that are in development. But basically you see a trend here and basically we have an explosion of different technologies and that's going to have an impact on how you want to align your reads. And so if you wanted to sort of look at these different sequencing technologies in a different way, one way you could look at it is if we look on the x-axis, I've sort of arranged these and what is the amount of amplification that's required? So on the far left, you have single molecule sequencing. On the far right, you have lots of amplification to the point where you probably have to use plasmids to do your amplification, et cetera. And then on the y-axis here is basically the read link. So at the very bottom, we have some reads as short as 20 base pair. At the top, you know, we're talking thousands of base pair. So all these actually play a big role in how do you align your reads and what kind of post-processing will be required. So the trick now is, okay, you have these sequencing technologies. Somehow you're trusting them to analyze their own data. So in the case of Illumina, basically, you have a lot of images where the light intensities basically tell you which bases have been incorporated. And at the end of the day, what you want to see is you want to see actual bases and preferably base qualities. And so when we're talking about base qualities, that's basically answering this question. Sequencing machines aren't perfect and so they're all making errors. The question is, how good are they at estimating what their error rates are? And how do you represent those in the files that you'll be using for alignment? And so this is the FASTQ format. Basically, this is the most popular format for reads that you'll see out in the field. And for the most part, it involves four lines. The first line always starts with an act and then basically you have a read name. In some cases, that can get quite complex. The second line is all the bases that you've sequenced for that read. Then you have a plus sign that separates it. You can have some other data after the plus sign if you want. And this is what confuses every single person that's ever seen a FASTQ file for their first time. They're like, what the heck is this? Well, I'll tell you. Basically, these are character representations. So this at sign, if you were to go to Wikipedia or another webpage to look up what an ASCII code is, you'd see that this at sign has an ASCII code of 64. And the way they've done these FASTQ files is basically there's an offset. So if you take that ASCII code and subtract 33 from it, you get 31. 31 is the base quality of that first base. And so you just keep on going down the line and then you figure out those. You're like, great. Now I know what a base quality is. What does it mean? Well, it's basically based on the FRED scale. And so it's a logarithmic value where every time you increase the base quality by 10, the probability that that base is an error goes down by a factor of 10. So basically, base quality 10 means that you have a 10% error rate, base quality of 20, 1%, 30.1%, et cetera. So you wanna see as high base quality as possible. I personally like 30 and above. You might have other requirements. There's a pretty interesting article that came out a couple of years ago where they tried to look at the different sequencing platforms. And so you see there's quite a lot of sequencing platforms here. Even amongst the ones that I showed before, there's different types of different generations of the platform. What's important to note is each of these sequencing platforms has a different error mode, meaning when you see a sequencing error, where is it likely to occur? So like in the case of 454, you had what are called homopolymer errors. And so basically you have indel errors, whereas Illumina has substitution errors. And then in the same paper, they've tried to estimate what the approximate error rate is. And so this might have changed now since two years I've gone by. These things tend to improve quite drastically. The last thing I'll mention about base qualities is back in the Sanger Capillary days, the base collar wasn't always the most accurate. Basically, if it told you that it was a base quality 20, if you really measured it, you might not really be a 1% error rate for that. And so basically people were making other base collars and a famous one was called Fred. But in this attempt, where basically this is from Mark DeBristo's lab at the Broad Institute, basically what they've done is a QQ plug. And so all that means is the base collars base qualities are listed on the X-axis. So here you have zero to 40, but then on the Y-axis represents what you've actually measured. So basically what we're saying is you take all the bases that were recorded as a base quality 20, and then you figure out how often was I wrong. And basically from that, you can re-calculate what the observed quality is. And so if everything's working perfect, you should have everything on the diagonal. As you see here, there's some bias. So basically up here, where you said it was a base quality 30, the measured looks like it's more like a 25. So one thing that you can do is you can run a base quality calibrator on this. And so with the GATK tool set, you can run this base quality calibrator. And then you see it's really nicely on that diagonal. There's a few interesting looking dots there. And that might be simply just because at the very upper ranges, you might not have a lot of data points that you can use to calibrate. And so you'll get a little bit of noise there. But all in all, this looks good. And the reason why you want the calibrated base qualities is because it's going to be really useful for the aligners later on to calculate an alignment quality, but really important for the variant caller. So let's start the next section, whoops. Where we're going to talk about some basic aligner characteristics. Now it's almost a shame I haven't updated this picture because this is my nephew Josh. And Josh is like, he's like this tall now. And he's like, like if I keep using this one, he'll punish me if he ever finds out I've been using this one at every single workshop. But re-sequencing or reference guided alignment is a lot like the most hellish puzzle that you've ever encountered. So instead of having a thousand pieces that you're trying to juggle, it's only you have like 40 million pieces. You luckily have a picture on the box and that represents the reference genome you're trying to align to. But you see there's a disparity. If this represents the reads that you're trying to align, well, holy cow, there's a whole bunch of reads here that are not on the box. So what's that all about? Well, that could represent the reads that are not actually in your reference sequence. But more to the point with alignment the complexity of aligning things. There are some regions in your genome that are very similar, very repeat heavy. And it's almost like trying to find the right piece for the sidewalk there. It's almost pretty much the same color. So you're pretty much relying on these puzzle shapes at this point to find the right place. Whereas if you go into Josh's eye, that's pretty easy. That's a high complexity read in this case. So it's pretty easy to place that one accurately. And then in some cases, you're going to have representation bias. There are gonna be some parts of the genome that are not easy to amplify. And so you're going to have these little gaps in your coverage. So if you pardon the whole analogy, it actually worked pretty well. Now for each of these sequencing technologies, there are different challenges. So if we take Illumina, Illumina uses sequencing by synthesis. And so what tends to happen there is if this top sequence represents your reference and the bottom one represents a read, what you tend to see is the sequencing errors accumulate towards the ends of the reads. And so why this is relevant for the aligners, take for example BWA. BWA has a built-in read trimmer. So basically it starts looking at the base qualities at the three prime side of the read and basically starts trimming off the really bad bases from the end of it. And then you'll get a much better alignment out of it. Single molecule sequencing, the reasons are different for each of the technologies, but the end result tends to be the same. And so basically in the case of Helicose, again if this is the reference and this is your read, there are some cases where you cannot detect the base that's been detected. In the case of Helicose, it was because the fluorophores did not fluoresce. In the case of Pac-Bio, it's because it's a Boltzmann process and it's tough to detect things in time. But what you end up with is you end up with these gaps in your reads. And so a lot of aligners are tuned for indels that are due to biological processes. In this case, this is totally a technical process. And so your liner has to be smart enough to deal with gaps in such a way that it's going to work both for the sequencing error mode and for biological indels. And then if we take 454 and ion semiconductor sequencing, we have homopolymer issues. So what we see here is we see two gaps. The first gap, basically we have a deletion here and so it is AC there. And here we have a deletion where it was basically a homopolymer string of AIDS. In this case, it probably wasn't a deletion at all. As your homopolymers get longer and longer, it becomes more and more difficult for these sequencing technologies to guess how many of the same bases are in a row. And so you'll see that in the read data. So gap two would be a suspicious gap. Gap one, because it's not a homopolymer, might be a true biological event. Yes. So often that might be a representation of how you're looking at the aligned reads afterwards. Oh, it's a derogatory. I just counted how many lines when I was coming this way that there would be an active base on the list way and then how many times you're not getting the other way. So would that be too difficult? Okay. I guess the bias is to occur to the left of the homopolymer line. I wonder if you came to the left or the right. I wonder if you came to the right. Yeah, no, no, that is the big question. I mean, you basically drove to the core argument there. In the case you mentioned, I haven't experienced that myself but there's certainly a sequence, there's bias. So you have an incorporation order with four by four. And so basically depending on what the context of that incorporation order was before you might get a different size of the homopolymer. I'll actually burden my colleague, Aaron. Aaron's our resident four or five, four guru or used to be. And so. All that knowledge. What? So yeah, feel free to talk to either me or Aaron afterwards and we can talk about it some more. We talked about base qualities before and that basically measured the error rate for each base and it actually took us, a liner developer is quite a long time to realize that, hey, that's a pretty good idea. Let's apply the same thing for linements. And so the alignment quality says the same thing. Basically, the alignment quality gives you the probability that this read has been misaligned. And so, as nice as that base quality calibration QQ plot looked, things don't look nearly as nicely if you look at the alignment qualities for the different aligners. In this case, I'm showing the QQ plot for BWA. And as strange as this might look, BWA actually seems to be the best calibrated of the aligners that I've looked at when it comes to the alignment qualities. These points up here are basically just sampling issues. That basically meant that even though, you know, you expected, you know, one in the thousand error rate over here because it said 40, it just meant that you didn't have enough reads or basically you didn't have any reads that misaligned at that point. And so we just capped it off at 99. If you kept on sequencing and sequencing until you eventually got a misalignment, I think it would all line up here on the diagonal pretty nicely. And so I won't go into detail about this one, but basically they're, depending on the aligner, the aligner has different heuristics it uses to align the reads. And so in the case of Mac and BWA, the alignment qualities were determined largely by one equation that basically calculated that the probability that the best hit you found was wrong. But then the second formula there basically tries to calculate, you know, we're using a heuristic algorithm. So basically maybe you didn't seed your alignment. Maybe there were too many errors in that 32 base pair seed that you might have used. And so here we're trying to calculate what are the chances of that happening? But an important thing to note is there's a common, I wouldn't say standard, but basically common practice where if you find multiple potential, multiple alignments that are equally good. So let's say you find a zero, you can align a read to five different places and none of them have any mismatches. They're all equally good. Then you tend to place an alignment quality of zero to that. And that way, like the variant callers know to probably ignore that read and basically rely on the reads that are uniquely placed. So there are two main strategies for aligning. One's called hashing. And then the other one uses a bureau's wheeler transform. And so all hashing really does is if your reference, or actually in this case, if the read sort of looked like this, what you could imagine is you make small subsequences of this. So in this case, I had a bunch of subsequences that are offset by one. And then you just ask the question, where does this subsequence occur in the genome? It occurs here on chromosome one. The next one you ask, and you see basically, you see a trend here. Most of these are occurring on chromosome one at these locations offset by one, but you're also picking up other locations because these are short subsequences. This happens especially if you're aligning to the human genome, where you have three billion base pairs to line to. If you have a hash sequence this short, then you're going to get many, many different hits. But at the end of the day, you look for this trend and then you say, ah, this is the potential aligned location. Like if you don't have a reference, then? Yes, if you don't have a reference, this entire lecture's got to move. So that's why I differentiated between de novo assembly and reference guided alignment. But what you can do, and what a lot of people do, is if you're dealing with a novel organization, you can either pick a related reference sequence to align to. So I was talking with one of the students before. I used to do popular genomics as well. And one of the things that we did when we were just trying to find out where the UTRs are, is just basically aligning to Arabidopsis and just see, okay, where are these things occurring? So you can use that for annotations. So you can do the same sort of technique there. Or the best technique is you basically combine both. You do a de novo assembly experiment, you create your contigs, and then from those contigs you do the B sequencing analysis. Yes, and that's actually an extremely good question. What I find is a lot of the aligners out there struggle. If you have a read that is more than 5% divergent from what you're trying to align to, a lot of the aligners struggle at that point. It's sort of, for those of you that have done a lot of comparative genomics, you see this problem a lot. And so there are specific aligners in that field that actually handle highly divergent reads. But for things like BWA, aligners like BWA and bowtie, they're not really designed to handle that kind of an edit distance. So speaking of BWA, BWA uses a suffix array or Burroughs Wheeler Transform approach. And this is a highly simplified look at what's happening. But basically, again, if you were to take this read and try and find out where it is in the reference, the idea is that you keep on adding bases from the end of the read, the suffix side of the array. And so if you were just to look for all the places where a T occurs in a genome, you're gonna get lots and lots of hits. But as you add more and more bases, as you're traversing the suffix array, basically you're going to sort of hone down on a few candidate locations. And once you finally found a unique hit or the best hits, you convert these Burroughs Wheeler coordinates back into genomic locations. And so in this case, hopefully you would get that same location that we were showing in the hashing approach. The Burroughs Wheeler Transform actually involves a bunch of other parts of the algorithm. And so I'd be more than happy to discuss that with you during the break. But it's a little bit beyond this lecture at the moment. And also it took the Aligner developers quite a long time. When I say Aligner developers, I used to be one of them, so it's okay for me to mock them. But it took us a long time to come up with a standardized file format. My own Aligner in the beginning had its own format and Mac had its own format and et cetera. It was crazy in the beginning, but things have settled down now. So now we have Sam and Bam. And basically the Sam file is the text version. And so that's what I'm basically showing here. And Bam is just a binary representation of what you see here. In practice, nobody in their right mind uses the Sam file format when they're doing bigger projects. It just takes up more space. It's slower to parse them. You want to basically do everything in Bam. But from an educational standpoint, it's useful to talk about the text representation instead of showing hex codes. So the first field here is basically the read name. The second field is a flag. That's a number that basically contains a bunch of little bits. What kind of flags would be in there? It would be, is this read on the reverse ground? Is this read one or read two? Is this a PCR duplicate? Was this read originally paired and or is it a single end read, et cetera? So I can also show you afterwards how you would create a flag out of that or a number out of those flags. The next thing, the next two fields is basically the reference position. So from this one and then this position. 60 is the alignment quality. That's what we just talked about. And one of the things that Michelle alluded to in her introduction, there are things called cigar strings. Cigar strings are basically just trying to tell you, if you have these bases, can you basically tell me how this truly aligns to the genome? And so M, so basically this is saying that the first 76 bases have operation M and annoyingly in my opinion, M means match or mismatch. So it makes it hard just to look at the cigar string. What they've actually done in the latest BAM format is they've come up with two new cigar functions. There's X and equal sign that actually show what's match and mismatch. But none of the downstream utilities that I've used actually support that. So we're stuck with this for now. But you're asking now, how is this useful? Well, it's useful if you have an insertion or deletion. So what you would see is if we had an insertion, you'd have like a two I would mean you have a two base pair insertion at some location. Three D would mean you have a three base pair deletion at some location. What cigar is it? To be honest, I actually have no clue what it stands for. I've always just heard people refer to it as the cigar string. Anybody, it's been forgotten in the annals of history. So in this case, we're presuming that this was a paired end read. And so it's always useful to know, okay, where did the other read align to? So equal sign is just a short hand of saying, okay, it's the same reference sequence. So it's also on chromosome one, but it aligns to this position. And then, there you go. So this last field length is basically, if you were to try and, so a paired end read, basically what you've done is you've sequenced the ends of a fragment, a DNA fragment. So length basically answers, what is that fragment length? And so basically it would just be the end of this read minus the start of that read. And then of course you have all the bases. And then you have all the base qualities encoded in that same style that FASTQ used. So basically, you take the base quality, you add 31 to it, and you get the ASCII code. Okay, so now we go into more light. Can you identify for the length that we do that? So if I'm interpreting your question correctly, there are situations where you might have done paired end sequencing, where the other mate doesn't align at all. That'll actually be one of the flags that you get here. One of the flags say, did read one align? Did read two align? And so you can basically stop by checking there. And in that case, length actually don't have a true number if it's a single-ended alignment. It's basically like this. If this represents your entire fragment and you have read one here and read two here, those two numbers that you're seeing if it's aligned in this way is basically this start position and this start position. And so you still have the full read length here minus the start position here to get this entire thing is what I call LEM. So there are a couple of post-processing steps that impact as you always use. In some cases, you can probably skip them if you want. One important one is, especially if you wanna do variant calling, especially if you wanna do structural variation calls, is you wanna take care of your duplicates. This is less important if you were doing RNA-seq experiments, the single-end reads, because it's really difficult to figure out what are your true duplicates in that case. But here's an example. This is almost like stolen from the Smithsonian as well. This is the old concept program. And so what we're showing here is on the top line is we're showing the reference sequence. And here we have a bunch of reads that have aligned. And so it looks like there are 10 reads that have aligned to this location. When you look closer, it looks like, wow, we have one read that's kind of unique there. But all these reads have the same start and stop location. And then maybe all these locations have very similar start and stop locations. And so what you might've thought were 10 different reads, actually turns out to be maybe three, or if you count this one four. And so that kind of bias can heavily influence the variant calls that you make later on. And to show what kind of problem this might be, this is a slide that I created way back from the pilot project in 1000 Genomes. And what you see on the x-axis here, each one of these represents a separate 454 run that we had in 1000 Genomes. And so the black lines basically show what percent of the reads aligned. And so we were getting almost across the board, 80% of the 454 reads were aligning. The red basically shows how many reads were left after you removed duplicates. And like you had almost a whopping 50% duplicate rate here. And after a while they figured out, well, what I supposed happened is they basically found some issue with their library prep, and they refined that. And so after about 120 runs, they figured out that we can do it better. And now it's only 10 or 15% of their reads were duplicates, yeah. Excellent question. Most of them are actually from PCR amplification. Well, it's systematic of a low complexity library. So if you don't have enough molecules to sequence, like enough diverse molecules to sequence, then you end up sort of on the sequence the same one over and over again. So you end up with a high duplicate rate. So it's more systematic of PCR amplification and low library complexity. There are also other forms of duplicates. For example, with Illumina technology, there was the notion that you could have clusters that basically are really close to each other to the point where it was really supposed to be one cluster, but the base calling software counted it as two distilled clusters. And so that's what's called an optical duplicate. But when we've tried to measure that in-house, we've seen that the optical duplicate rate is really small to the point where we no longer try to quantify that. Yeah, we could discuss that some more a little bit later. There are a couple of well-known programs that are really good at removing duplicates. You have the card tools, and SAM tools also has its duplicate removal mode as well. And so in our lab today, we'll be using SAM tools. Another aspect of alignment that it's a really good post-processing step is indel-treating. And so another way of terming this is indel-realignment. Now, the truth is a lot of these aligners are not very good at aligning the reads that have insertions or deletions in them. And one of the characteristics of an aligner is you're always looking at just one read at a time and then you're aligning it to the reference. But it can be useful if you basically look at all the reads that have aligned to a particular location. And to highlight this, this is basically looking at a screenshot of IGV. And if you're not used to it, basically IGV, everything that's gray means it's the same as the reference. Anytime you see an actual base here, that means it's different from the reference. And so you're looking at this and you're like, oh my gosh, this is a very variable region. And you wouldn't be the first one that made the mistake to trust these if you're very college and picks these up as snips. If you try and do a wet lab validation on it, you'll be very upset to find out that there were no snips actually here. So the same region, what it looks like after indel cleaning looks like this, drastically cleaned up. And whereas you didn't pick up any indels here, here you have a solid one base peer deletion. And now you only have a few divergent bases here from the reference. And so that's one of the, yes. That is a good question. And right now they don't really make that distinction when they're doing the indel cleaning. The hope is that you see a pattern. So on the next slide, we basically see a little bit more of a pattern. And basically by doing a multiple sequence alignment at that location, the hope is that you can elucidate whether or not it's a technical or a biological indel, what it's supposed to look like. This might be one of the few times where the origin of the actual indel doesn't matter as much as how you can properly visualize it. So this is an interesting example. So this is before and after at the bottom. And what we show here is, here we're showing a gapped assembly. And at the top is the reference sequence. And what we're showing here is basically there are two insertions. So here we have four reeds that have inserted some bases and then you have one reed that's inserted some bases here. But after you've done, so they call this the scatter mode, basically you scatter a bunch of indels along. But after you've done indel cleaning, basically you figure out, oh, this is just one insertion event I just had. And so basically this gets a lot more cleaned up. One of the big changes here is when you didn't have the indel cleaning, you had a bunch of mismatched bases at the end of that reed. If you had a low coverage sample and you were doing variant calling, you might mistakenly pick that up as either single nucleotide polymorphisms or a multiple nucleotide polymorphism. Any questions so far? All right. GATK has an indel realigner and that's the indel realigner that we'll be using in today's lab. And then now let's just finish up the lecture by talking about what makes a good aligner. How would you actually evaluate different aligners? And so the first thing we did, the first thing that comes to anybody's mind is like, what is the fastest aligner out there? I want speed. And so we compared BWA soap and bow tie. And what you see here is out of those three, BWA was by far the slowest. Soap was almost twice as fast and bow tie was almost three times as fast as BWA. Okay, speed is nice and everything, but there are other issues to take into account. Okay, so if you had a simulated data set, one that you've simulated sequencing areas and everything else, actually in this case, it wasn't simulated at all. We took a true data set and we used an extremely accurate aligner that guarantees to find every possible alignment. And that you know what this also means, it's extremely slow. But knowing that, basically what you can measure is on the x-axis, we have the edit distance. So basically that means how many mismatches were in those reads. And then on the y-axis, we show what percentage of those were in the solid line aligned to begin with and in the dashed line where how many were correctly aligned. So basically what I'm saying is that the gap between the solid and the dashed line is basically what percentage at that edit distance was misaligned. So the one thing you notice is bow tie has a very low alignment fraction. Basically there are a bunch of reads even when you have an edit distance of zero, in some cases it wasn't sensitive enough to find the alignments for some of those reads. So it was interesting in that if it did make an alignment it seemed pretty certain that it was correct. And then BWA aligned the highest fraction but it also with that you trade off there's some percentage of those reads that are misaligned. And then we've already discussed the alignment quality calibration. So we looked at how well the alignment qualities were recalibrated amongst all those. I won't go into detail on that. But this is the kind of evaluation everybody would do if they were looking at the aligners. But at the end of the day, what you're really concerned about the reason why you probably said about doing alignment and varying calling or a bunch of other processes is you probably had a fundamental biological question that you were trying to answer. And so it really doesn't matter how well your alignments look according to a bunch of made up metrics that I showed. It really matters how does it impact your results at the end of this pipeline. And so one of the things that we did in this experiment is we used the same variant caller with the same exact settings for all these different aligners. And we also did a trio conflict analysis. So in this case, a trio is just basically you can sequence two parents and a child. And then you can basically use Mendelian genetics, create your own Punnett square to figure out is this a plausible, yeah, I've added this one. I'm seeing all the confused looks. There are two slides today that I've added in. I added in this one because I realized if I just haphazardly mentioned a trio conflict, maybe not everybody knows what that is. And so basically in this case, if your two parents have these genotypes and your child has this genotype, that's a valid statement. But if one of your parents has AT, the other parent has AA, there's no way your child can have homozygous T. And so that would be called a trio conflict. Exactly, exactly. That milk man. Anyway, so when we looked at trio conflicts amongst the variant calls produced by these, using the aligners produced by these three aligners, what we noticed is we had a very high trio conflict rate when we used bow tie. And when you think about it, it makes a lot of sense. Bow tie is what we call an un-gaffed aligner, meaning when it aligns, reads it will not add gaps to it to accommodate indents. There's some post-processing steps that it can do, and bow tie too I think gets around this. But the original bow tie didn't. BWA has the lowest trio conflict rate. Looks very good. And soap, well, I actually cheated with soap. So soap doesn't actually calculate alignment qualities. And so I decided, ah, that's too bad. I'll just, I'll make one for them. And so when I did that, it actually turned out to be pretty good. But so you shouldn't pay too much attention to this. This was cheating to the next level. So we've talked about accuracy. Now that we've seen what the trio conflict rates look like, let's go back and look at some other metrics and see if we can see a pattern. And so here again, we're showing BWA soap and bow tie. And here we have the trio conflict rates here. Now I've listed some other stuff here. So what was the overall coverage? What was the genome accessibility out, 1x or 10x? So basically that means what percentage of the genome is covered by at least one read? What percentage of the genome is covered by at least 10 reads? And then what was the aligner accuracy again? What you notice is aligner accuracy is not a very good predictor for, for what the trio conflict rate was. And neither is really, you know, coverage seems to be a very good predictor and the genome accessibility is kind of a secondary characteristic. So that basically comes down to two things. And that is, if you have an aligner that's able to align more reads than another, just the virtue of having that deeper coverage for your variant collar is going to give you better results. And then another conclusion you can come to is it doesn't so much matter if your aligner doesn't have the best accuracy. What matters is if the aligner is smart enough to figure out that it's crap at aligning certain types of reads. Basically giving those reads a very low alignment quality. And I think that's what saved some of these aligners here and produced that low trio conflict rate. So just to recap the aligner comparison, BWA was the slowest but it produced the lowest trio conflict rate. It was also tied with bow tie for the best resource usage. So we also measured things like, how much memory is required? How much does it pound away on the disks? It did really well on that. Also community acceptance is probably the best for bow tie than the other two. Looking at bow tie, bow tie was the fastest of the three that we looked at but it wasn't very good at the trio conflict when we were looking at the variant calls. That might be because it's a non-gap aligner. And then finally soap was kind of the middle of the pack and perhaps that's only because I cheated and I created the alignment qualities for it. So that might give you some perspective of how you might evaluate other aligners. Now having said that, your needs might be totally different. And so here are different aligners that you might have seen or have heard of. And they kind of address different areas. So for speed, there's a couple of new aligners. There's Snap and Isaac. So Snaps from Berkeley, Isaac's actually from Illumina. Accuracy, there's a really good aligner called Nova Line. It's a commercial product though. And so if you want to use it on more than one processor, you're going to have to pay a licensing fee. Razer S is also highly accurate. The trade-off is when you go for these super accurate aligners, you might have a trade-off in speed. All around aligners, BWA, especially, it's built into BWA, but there's a new mode in BWA called BWA Mem. It's actually a bit faster. It seems to be somewhat more accurate. And we'll actually be using that in today's lab. And then you mentioned Bowtie. Actually at that site, you can also find Bowtie too. And then functionality. All these aligners here basically address the typical resequence in case. But you might have specific needs. Maybe you're looking for DeNovo's splice aligner. And in that case, this would be more applicable for the last workshop that you guys had. The starliner and top hat are ideal for trying to find DeNovo's splice junctions in RNA-seq data. And so for methyl seat, yes. Yes, yes. So it's kind of a common theme in all of bioinformatics. There's another computer science term called GIGO. Garbage in, garbage out. As the earliest point, if you can detect any sort of flaws in your methodology, if you can correct for that, it will help the end process. And so not everybody has the time to sort of do an aligner evaluation upfront. But if you do, that is a really good way of moving forward in a bigger project. We certainly did that in the 1000 Genomes project. We had all sorts of aligner comparisons, variant color comparisons, structural variation tool comparisons, because we wanted to find out what will give us the best results for that initial pilot paper. So, no, that's definitely the strategy we should take. That's the truth. So not so much merge. I've seen programs out there that help you benchmark how well different aligners are doing for different scenarios. And I can give you some links to that later on. Any other questions? Yes? So what are they? No, no, it's not a naive question. I haven't seen too many problems on the alignment level. Your problem tends to be more about GC bias. If you were to take something like Plasmodium falciparum, that's a very GC rich organism. That's very problematic for aligners to handle. Also problematic for some sequencing technologies. But as far as Michelle's telling me that I should get moving, I'll oblige. So, luckily my next slide was just a summary. And the summary is basically, alignment is heavily dependent on your sequencing technology. What is that sequencing technology's error mode and error rate? Also, as important as alignment was, the post-processing steps are equally important. So I just wanted to stress that one more time. You wanna mark or remove the duplicates. Preferably, if you can do indel realignment, you should. And then, again, let's stress that we have a standardized file format now called BAM files. And so if you hear people talking about BAM files, now you should know exactly what they're talking about.