 So, I apologize not seeing you guys as much today up front here and making Malachi almost lose his voice, but what are brothers for right. I have been trying to get the team assignments kind of fixed. So there's an optional team assignment this evening. And it was not quite in the shape that we wanted it in so I've been kind of reworking that extensively. So that hopefully you will have a smooth experience going through the team assignment and the TAs will have a smooth experience helping you with that. So I have not just been slacking off but I've been working on something else course related, I promise. Okay, so we're going to just finished I guess we probably have time for maybe this lecture and that's probably going to be the end and we'll move on to actually continuing the practical alignment exercises tomorrow. We'll see how long this takes but just going to do a quick intro to some of the concepts surrounding alignment. So specifically, or actually generally alignment is really a common feature of bioinformatics and you know looks analysis. And this is like a very common workflow right where you have like phase zero generate your sequence data get raw fast cues and then the most common thing you do is some kind of alignment alignment to a genome alignment to a transcriptome. Commonly you'll do quality control and then in the green boxes are all the different paths you might go down like peak finding or quantifying transcripts like we're going to do or finding different kinds of variation and so on. And then that of course feeds into your analysis and interpretation and hopefully cool new discoveries. So this alignment work alignment is you can kind of think of it as like fitting individual pieces or reads into the correct part of the puzzle. And here the, the reference genome the human genome or whatever your reference genome is is kind of like the picture on the box that helps you figure out which piece goes where. And then there are the possibility of kind of imperfection how the pieces fit together relative to the reference or between multiple samples that can give you an idea about variation. If you don't have a reference genome then meet your workflow looks totally different. I highly recommend you get one if you can. There are big health. So are you seek alignment challenges. This is still true to some extent there that can be computationally costly. You have the extra complexity relative to DNA alignment that you may have entrants so you're right you may have spliced data, and you might have to make a choice about whether to align to the genome with like a splice aware or sort of create a splice reference of the transcript domain aligned directly to that. And you might wish that you could just align your data once and be done with it but that's generally not how it works in bioinformatics or genome analysis. Typically find yourself doing alignment over and over again with slightly different parameters or different software depending on your goals. In terms of RNA seek historically there were three major alignment strategies. So there was a de novo assembly, which is really for the case where you don't have a reference genome or maybe a transcriptome and you're trying to like infer the transcript structure directly from your data. So we have a module that we won't get to in this version of the course that we do in longer versions that actually introduces to some tools for transcript assembly. You can align to the transcriptome, or you can align to a reference genome with a splice aware aligner. So we're going to do the third approach which is probably the most common approach now. The third strategy is best. It sort of depends. So, as I mentioned, if you don't have a reference genome, you may have no choice but to do a de novo assembly approach. There also is some possibility that there's like complex variants or haplotypes that would be missed using a reference genome approach where the reference genome itself can't like fully represent that complexity, right but most non graph based reference genomes are just like one single representation of like an individual or a simulated individual. So in case of where there's like complex structure variation or a lot of like small variation packed together that might cause you problems that you could get around with using a de novo assembly approach. So in the transcriptome. If you have short reads this, this used to be the preferred path like if you had less than 50 base pairs. That is not as common now so most places are generating 100 plus reads and so this is not as common of a strategy. It also relies on knowing the sequence of transcript of having a known transcriptome. So the reference genome is kind of for all other situations and now as I said most situations. It doesn't necessarily rely on known transcripts although you can use known transcript structures to guide the process. It can still allow for discovery of new transcript structures from like a reference guided approach. There are multiple splice where aligners that allow you to do this aligning directly to the reference genome. So, there are many aligners. We're going to introduce you to one or two or maybe three in this course there's dozens of others. And this kind of like a history of various aligners. I don't think it's too small three but high set, which we're going to use is on there somewhere. So should I use a splice aware or unsplash unspliced aligner. We kind of talked about this. The fragments that were sequencing represent MRNA with the introns removed. But we're usually aligning these reads back to the reference genome. Again, unless you have short less than 50 base pair reads, you're probably going to use a splice aware aligner. Specifically in this course we're going to introduce the high set aligner. So high set is one of these splice aware aligners. It does require reference genome. It's very fast. And it's a very robust and well developed and maintained tool which is one of the reasons we chose to use it. It uses an kind of complicated indexing scheme based on the borough boroughs Wheeler transform and FM index, and it does multiple it uses multiple types of indexes for alignment. We're going to kind of walk through some examples just conceptually of how this works. There are several papers that describe the algorithm and like it's all its gory details, which encourage you to check out if you want to understand it at a deeper level. But basically, what it does is creates we're going to build these indexes, you create a genome wide index, and then a large number of smaller local indexes. And then you kind of do this iterative alignment strategy game using those two indexes. It also tries to account for or it has the option to account for known polymorphisms and known transcript structures. So you can feed it information about known X on X on junctions and and snips. So as I mentioned it uses this hierarchical indexing algorithm and then several adaptive strategies based on the position of a read with respect to splice sites. So we're going to look through some examples of what I mean by that. It has this multi step process where you first find candidate locations across the whole genome mapping part of each read using this global index. And that usually gives you one or a few candidate locations for your read as you try and figure out where it aligns to, and then you switch to a local alignment using these local indexes. So it takes the genome it makes one big global index, and then it makes 48,000 local indexes and uses them both to come to a final alignment. And then for paired reads each pair or each mate is mapped separately. And if a read fails to align then the line the alignment of its mate can be used to anchor it and sort of recover that on a line to me. So here are some examples to kind of hopefully make this a little more clear so what we're looking at here are two exons from chromosome 22. So there's an exon one there's a relatively small intron, and then there's another exon. And we're going to look through examples of how this aligning strategy works for three different kinds of reads. For the read that is just entirely contained within an exon for a read that slightly spans from one exon into another, and then another that is sort of more evenly split across two exons. So for this first example this is kind of the simplest example. We align this first read. The first read is in red here. We learned the first read using this global index. And this part is kind of the time consuming part. So what we're basically trying to do is find a like kind of a sub alignment so it looks to build a match of at least 28 bases long. So once it finds a unique place in the global index where this blue 28 base pair sequence matches one or more places, it then switches to the local index. Actually first it attempts an extension. So, without even like trying to get clever just says okay I know where these I found this 28 base pair match. So the first example of the genome using the global index. The first thing I'm going to just try and do is just straight extend it like if the read belongs here, then the 29th base should be what I expect based on the reference and the 30th base and so on so it just tries this extension step. And if it matches and matches and it's like great, I figured out kind of where I thought it belonged using this 28 base segment, and I extended and it all looks good. The read must belong here. So that's kind of like the simplest possible situation. If you happen to span an intron of course it gets a little more complicated. So again you start with this for the second read. We have this problem where there's actually just like a very short segment on the other side right so that's going to be a little bit more challenging. So this 28 bases on the right end of the read map again we find this unique match to the global index. And then we again do this extension phase. So the purple. And so we're just checking to see if the read matches to that place that we've, we positioned ourselves using the global index search, and it works up until like the 93rd base or something but then you start getting mismatches it hits this intron and it's like these this is no longer the right place there there's mismatches. So at that point it switches to the local index and attempts to align the remaining eight basis so now it's looking for an eight base pair match in this local index. So if you were looking genome wide for this eight bases you would probably have like, you know, millions or I don't know 10s of thousands of matches right this eight base pairs is not that big of a sequence it's not that unique. You have no hope, but because you kind of anchored yourself using the global search and then extension to this local place you have this much smaller index so there might only be one or two eight bear eight base pair matches in that local index, which is like 150,000 of the genome, one out of 48,000. You can align with nearby match, and then complete the alignments and create this spliced alignment. It does some other checks and then if they kind of like make sense in terms of the, the orientation of the match, like if the eight base pairs was matched but it was in the wrong direction that would fail right it has to be like eight base pairs in a like consistent direction. Yeah. That's a good question yeah so it would probably favor an alignment that like matches up with the known exon but depending on which mode you run the liner in it doesn't necessarily have to, it could be in this case it matches kind of perfectly with the expectation with the exon boundary but it doesn't necessarily because we can with this a liner we can find like novel splice junctions for example or novel exon exon junctions so it's not constrained to only alignments that like are supported by known annotations. But the known annotations can be used to kind of like guide it. I think we just have one more example. Okay, so this is the example where there's kind of like more of an even split between the two. So again it does is global search, trying to find one match with at least 28 bases so this is the kind of slow process but once it finds that 28 base pair match. It attempts an extension. So in this case, it can extend for about 50 bases until it starts mismatching at the 51st base. And then switches to that local index so now again we're looking for a small hit like an eight base pair hit in the kind of local region and we do find one nearby. And with that hit there's still more read to be explained so now it switches again to an extension mode, and if it can extend that eight base pair local index hit to the full read it knows that it's mapped this full kind of 50 base pair on each end alignment. So those are kind of three scenarios of how this strategy works. So it uses kind of these logic and heuristics and knowledge of known transcript structures to come over this kind of best educated guess where each read aligns to in a splice aware manner. Yeah. Yeah, that's right. Yeah, I would find like all the eight pair eight base pair matches and if there's more than one it would then switch to look for, like, I guess larger matches. That's a good question. Oh, sorry. He asked what happens if the first 28 base pairs is split across like a boundary like if you can't make that first 28 base pairs. I forget what it does in that case so I think the global search, the default global search would fail and. Yeah, let me research that to remind myself exactly what it does I'm sure it has like a fallback strategy where it either like goes to like extends further like to shift the 28 base pair window over so that it could find maybe a 28 base pair match and also then extend backwards, or. Yeah, I forget the details but it's probably it's something like that. That's a good question though. Yeah, try from the other end of the read that seems like that would be a logical thing to try. Yeah, that could be like the another answer to his question about like what to do with the first 28 base pairs. I just forget the details of specifically what it does but it's probably something like that. Yes, does it for read one reach at least at first but then if it fails one read it uses the other read to anchor and retry the failed read. That's right. Yeah. That's a good point too. Yeah. Yeah I haven't I don't remember reading that somewhere but I guess I do imagine like at some point for very short reads it would start to struggle like 30 base pair reads or something you know like, which would be pretty antiquated now but yeah. I think these settings are like these are the, these are default behaviors. So I believe you could like, you could change secret there probably is like a certain amount of tuning you can do to like change the default way that this works for a different situation like, and usually short reads or something. You fall within the exon, not crossing the fast. Most. That's true I suppose. Yeah, that's true. Yeah. Which sort of like solves the problem in one sense like it solves this problem of like correctly placed in the read but then it hurts you when it comes to trying to actually resolve the transcript structures. Yeah. So what do you get out of high stats you get a Sam or BAM file. So BAM stands for a sequence alignment. Map format, BAM is just the binary version of a Sam file. So BAM file is basically a Sam file. The difference is just that the Sam file is like plain text so you can actually open it and read it. It's like a compressed version that it's like BAM is to Sam what the fast Q dot GZ is to fast Q, right, just like a unreadable compressed version of it where you need at that point like special software to interact with it. And it reduces the size of the files like quite substantially like I want to say it's like 110th the size of a Sam file, and really nobody does anything with Sam files. Like at this point way too big. Everything's either BAM or cram. Well, a lot of sense cram now which is like the next even more efficient compression strategy for for BAM files.