 So yesterday, what we have done is we have looked at the methods on how to generate sequencing data. We have mainly focused on Illumina data, but we have also shortly looked on long read sequencers. And of course, you have done the quality control on a dataset that wasn't very good, and that you could quite well improve with adapter trimming and read base quality trimming. So after you did the quality control on your sequencing reads, so on your FASTQ files, usually a next step, depending a little bit on the application, but usually a next step would be read alignment against a reference genome. That's what this presentation is about, and that's what you'll later do in the exercises. So what you have seen is a FASTQ file, either a raw FASTQ file, let's say, and a trimmed one. And what you then usually have is a reference genome, a reference sequence, depends, of course, very much on the organism, on how good the quality of that reference sequence actually is. So, of course, for human and mouse, for hybridopsis, for drosophila, for example, you can expect that that reference sequence is of relatively high quality, although there are always gaps in there, so missing pieces. But you can assume that it is quite complete and for other organisms where there is only a very basic assembly, for example, you'd have to do with a less good FASTQ file, less good reference genome. And that could also mean that it has implications for, for example, the read alignment and for the downstream analyses. Anyway, so we assume you have a reference genome, and what an aligner does, it, for each read, it tries to find the most likely position on the reference genome, where it could have, let's say, originated from in the sample you have actually been sequencing. So, what an aligner does, it finds, for each of these reads, a position on the reference genome, and that information is stored in a sample. And the sample is, well, yet another very well-known file that is used in bioinformatics, if you compress it, it's called a BAM file, more on that later on. But the stored there is, are those positions where those reads align to the reference genome. And the aligner is there to generate that sample. This is how a sample looks, if you visualize it. So, I think at least some of you, if not all of you have seen touch visualization before, this is one particularly from ITV. And what you see over here are paired and reads, where you see the forward and reverse reads linked by a small line. And the small line is that part that is not deep end of the fragment, and all the likely aligns to the reference genome. You also see some colors there, and those are mismatches between the read and the reference genome. And all that information, where it aligned, where the mismatches are stored in that sample. And that's generated by the aligner. So, how do these aligners work? In principle, it's pretty basic, pretty simple. The aim is to find substrings, which would be the sequence reads in a large string. And of course, in many cases, that might be exact matches, so where the read exactly aligns precisely to the reference genome, but also in many cases, it is not an exact match, but a nearly exact match. So, the substrings would be the reads, and the large string would be the reference genome. However, typically, we're looking at millions of substrings, if not hundreds of millions, sometimes even billions, of substrings that you want to find in a very large substring that can be tens of millions of characters long, or even more. So, that is a computational challenge. So, if you would just do that, let's say by hand, what you would do is take your read and try to find, try to match it at every possible place on the reference genome and see where it matches. Of course, if you want to do that for millions of reads in tens of millions of characters, that is the reference that will take literally forever. So, we need to do something to speed this up. One way to do this, or what's very often done in bioinformatics, is something called indexing. So, indexing is done on many large files in bioinformatics, and you can of course also index a reference genome. Basically, what you do is kind of generate a phone book of your reference genome in order to enable these fast searches, so it doesn't take forever to find that particular match of that read in the reference genome. And that is done often like this, or at least as a starter. So, what you do is basically take the entire reference genome and basically take off one bit of the start, and then do that for the entire reference genome. So, let's say we have a reference like this. So, a dollar sign, of course, usually your reference genome is much, much bigger, but this is a very small reference genome of only five base pairs, for simplicity. Then we take, let's say all the shorter substrings of this reference genome and put it in a table. So, we have this D88. So, the full reference genome, we take off the T and then we edit there and so on and so on. So, you can already kind of imagine that if you are searching here, you can already find some substrings if you're looking for them. However, you make this more efficient by alphabetically ordering this matrix over here and we order it by the nucleotide. Here the dollar sign, by the way, I should have mentioned that the dollar sign always depicts the end and that can be relevant later on. That's also relevant for the suffix area, by the way. And that's always the starting point. So, alphabetically, the dollar sign always comes first. So, we alphabetically order our substrings and then we get something that is called a suffix array. So, if you have worked with biathematic algorithms before, then you probably have heard of a suffix array before because it is a concept that is used quite frequently. You can already see from the suffix array that, well, probably searching goes pretty fast. For example, if you have a read starting with an A, you only have to search in row four, one, and two in this case, or the second, the third, and the fourth row. So, that already goes a bit faster than going through the entire reference genome. What you can also see is that we have a relatively small reference, but we store this reference almost like three times in the suffix array. So, the suffix array can become pretty big. Okay, so apparently the animations aren't that great. Daniela has a question. Yes, I was wondering, I mean, this is for whole genome sequencing, right? So, in my case that I do amplicon sequencing, do I still put the whole genome as a reference, or put the amplicons of my wild type organism, because then the amplicons are the lengths of the reads, right? So, it wouldn't need to find where it is. Yeah. Well, in each part, it will be relatively easier, because you probably also have way fewer reads and you know where you want to align your reads. However, you do, of course, a PCR on your entire genome, on your entire DNA. So, there can be, for example, amplifications of regions that are, for example, similar to your target region and that are still part of genome. And sometimes that information is relevant to know. So, that would mean, in principle, what you could do is just take your target region as a reference and just align your PCR. So, your amplicon sequencing reads on there. However, that would also mean that you would force a read that actually comes from a different region in the genome onto your target region, which actually might not originate from there. So, even for amplicon sequencing, I would say it's relevant to take the entire reference genome and align your reads again, that reference genome, and then take only the part that you're interested in for a variant calling, and of course, also check where reads end up that are not in your target region, just as quality control, to see whether you haven't been amplifying parts of the genome that you are not interested in. Okay, thank you. All right, you're welcome. So, I am going to give you the full slide, because the animations are a little bit messed up. Sorry about that. So, if you start the query, this suffix array, that is much faster than just looking at the reference genome as a whole, because what you can do is a binary search. And binary search is much faster than, let's say, a linear search. I think you call that linear search, where you just search for the entire reference genome. So, what you do is, let's say you have a query called ATA. Well, we can already see that it directly fits over here at this part. But of course, as you know, it's usually much larger. So, what you do is you start in the middle of the suffix array and see where your query has to be. So, at the upper part or the lower part. Now, we see that it has to be at the lower part, because the middle has AA. So, we have to go lower. Then you split that again. So, we have to go up or down. And then we see we have to go up and we directly have our match. So, with a bigger reference genome, that's of course, a much larger suffix array. So, you usually have way more steps in your binary search. But your binary search is way faster than, let's say, a linear search, where you search the entire genome. So, that's a step forward. We did do have a suffix array, which is much larger than our reference genome, but our searching is much farther. So, that's already quite nice. However, we have this huge suffix array. And for big genomes that can really go up to 100 gigabytes. So, there is a solution for that. And it's a very elegant solution. And that is the Burroughs-Miller transformation. So, our suffix array is very large. We store the same sequence multiple times. But what you do with the Burroughs-Miller transformation is that you only store the first and the last column of our suffix array. So, in principle, what you can do is take the suffix array, but include all the sequences that come after that. So, you kind of circularize your reference genome. So, if we look at, for example, let's say, this line line, the third line line one, we have our, oops, we have our full reference genome that is P-A-A-T-A and a dollar sign. It would mean that Burroughs-Miller transformation would have still a T on the end end. So, in each line, basically, you store the entire reference genome. And you do that for the entire suffix array. So, you have the Burroughs-Miller transformation over here. But that's not the full Burroughs-Miller transformation. So, this matrix would of course be, again, huge. It will be, I think, a quadratic size of the reference genome. But what's nice about Burroughs-Miller transformation is that you only, if you only store the first column and the last column, you can always go back to your full reference genome and fill the, you can do binary searches. So, what we do over here is basically the first column, you can already see that, is just the alphabetic order of all the A's, C's, C's and G's in your reference genome. Well, the last column is something different. It's quite a specific order. But with this first and this last column, you can completely reconstruct your reference genome and you can do directly with this first and last column, these binary searches. How that exactly works takes a little bit too long to explain. But you can now just assume that, just by storing this first and this last column, it's called Burroughs-Miller transformation. And that concept is used by most of the frequently used aligners. It is a bit clear for you, at least a little bit of the concept. Yeah, so this step from Burroughs-Miller transformation to binary searching, that's still a bit of a long step to go, but you require quite a few steps to really understand what's going on. Nice to know is that Burroughs-Miller transformation is usually smaller or at least as big as the reference genome. So you do not require a lot of disk space to store it. Second about that. So about aligners in general, there you can use global or local alignments. So you can try to match the entire read to the reference genome and you always do that. If it does not align completely, then you will have penalties, meaning that you have, for example, a reduction in the likeness that a reader should actually align at a position. And there is local alignment. A local alignment tries to align only parts of your reads. So that means that it would allow for clipping, meaning that the aligner only aligns, for example, the middle part of the reader reference genome and then removes, just ignores the outer parts of the reference genome. Both have their advantages and disadvantages, but what you usually do is you use global or local alignment in different situations, depending mostly on your organism, for example. You can imagine if there are a lot of mismatches of insertions, deletions expected between your reads and the reference genome. For example, if you're aligning your reads to a related species to the sample you're actually sequencing, then it's often a better idea to use local alignment. If the reference genome is very similar to the reads you're actually sequencing, you actually usually do not want this soft clipping to happen because that might lead to spurious alignments, that alignments that are actually not true are not supposed to align and therefore you might choose for global alignment. I have a question for you and I think that's a bit related to this or it should be related to this at least. So the question is actually it is kind of what I just said. So suppose you want to align reads to a reference, in which cases would you use a global aligner and when would you use a local aligner? So local would enable allow for this clipping and global would try to align the entire reads and stop. And of course most of you are correct I think. So global alignment if query and reference are similar and local if unsimilar. So that's like typically how you would use them. Of course there are also use cases where you only are interested in specific part of the genome for example that has a lot of indels then you might also consider to use either local if the indels are very large for example you have very small reads or the other way around if you are very much if you really want to align the entire read. So software how to actually perform that practically these align these alignments. So for let's say basic alignments meaning for example genomic reads to genomic sequences most popular still would be BOTI2 and BWMM. They're both based on really the basic Burrows-Milleut transformations so the Burrows-Milleut transformation as I just tried to explain it to you. However they are slightly different and the difference is mainly in the default settings. So if you use BOTI2 then the default is a global alignment meaning that you try to align the full read to the reference genome and BWA there the default is a local alignment. And for both of them you can make them local or global. So you can have local alignment with BOTI2 only have to change the settings and think counts for BWA. And usually if you change the setting they have very similar results in terms of alignment. Then you also have spliceware aligners so if you work with RNA-seq data then you have probably heard of for example high set 2 which is based on a very similar algorithm and R is also developed by the same authors as BOTI2 and STAR. You probably have heard of STAR before. So they're both spliceware aligners which would mean that you can align reads that are generated from transcripts to a genome that has splicebyte. So if you work for the bacteria that's not very relevant you can always use BOTI2 and BWA. However if you work with a new carrier then of course you have splicing so you have entronic sequences in your reference genome and if you are going to align your RNA-seq reads to your reference genome then of course you have to take these introns into account. So what you basically do with spliceware aligners you have your messenger RNA and you generate reads from your messenger RNA and then you are aligning your reads to the reference genome with these introns in there so there's some splice size in there. BOTI2 and BWA if you would use those to align your reads to genomic DNA they will just find a lot of penalties and therefore a lot of problems with aligning in these entronic regions or with these entronic regions while high set 2 and STAR take these entronic regions into account and will not give a penalty if there are introns. Then a little bit more modern aligner that's also a bit more versatile than the above is minimap2. I guess it's gaining quite a bit of popularity so it is initially was developed mainly for to be able to properly align long reads so reads coming from backbio or octopore nanopore technology but you can also align short reads with it and you it can also be used for spliceware alignment. So you can do long reads spliceware alignment then short reads spliceware alignment or with minimap2 also partly based on birds with transformation but it's a bit more sophisticated. Then a few words about mapping quality so what happens very often is let's say you have a genome and you have two regions on a genome that are very similar to each other. For example a cospire transposeable element or a recent gene duplication for example. So let's say these blue parts they are very similar to each other and the screen part is generally unique so it's easy to align reads to the green part but if you have a read from the blue part then you are not entirely sure whether to align at this first part at this first blue part or at the second blue part because the reads can come from both of those regions and that information is depicted in the mapping quality. So the mapping quality says something about how short the aligner was that the position in the genome is the correct position and nice thing is that this mapping quality is calculated in the same way as base quality so it's also using this FRED score but now not on the probability that the base is wrong but on the probability that the mapping position is wrong. So what the aligner does is calculate the probability that mapping position is wrong does this FRED score transformation in order to give you a score. So if you are 99% sure that the aligner is 99% sure that the mapping position is correct then you have probability that the mapping position is wrong of 0.01 and therefore you have a mapping quality of 20. So high mapping qualities means good mapping so that the aligner was sure that the reads actually belongs there. If you have low mapping qualities for example mapping quality of 3 which would mean that you have 50% chance that mapping position is wrong so that would for example be the case in in the example above where you're not sure where the blue reads are coming from you get a low mapping quality of 3 and often that is even 0. So then the aligner was completely unsure where the reads actually should belong and therefore it gives you the random position and that can of course give problems if you are for example interested in variant analysis or if you are do for example counting for RNA-seq analysis. That's it considering this presentation. Are there any further questions regarding readalign? Sofia has a question. Yes so what will happen in the event that you have a sequence that can actually come from two different genomic locations? Are they just mapped to both and they tell you that this is not very sure or can you disregard them or how does it work? So that would be indeed exactly this situation. So it's a setting on both both time BWA what the aligner will do with such read but usually it will just give it a random place. So not random place over the entire genome but it will align it to either one of those positions and then it will give a mapping quality of zero so it will align it but it will say I'm going to align it here but I'm completely unsure where it should be here. Okay okay I will tell you what let's say if there would be two possibilities for alignment of the sequence it will align it to one of them it will tell you that he's not very sure will he tell you what the other position could be or not? Yes also I think we can have a look at both I do so I think that would be the secondary alignment. Okay and then and then the position is also stored in the bone film. Okay good and I think both are too just that. Thanks. Yeah right I have one serious question. Yes sorry to repeat it so I'm not clear with the YV index reference genome. Okay yeah can you explain it a bit? Yeah so of course so let's say this example. So we have ATA at the query over here so it's three nucleotides at the read and we have a full reference over here. So in principle computers they are stupid right we have to tell them exactly what to do. So a very very naive way to find the most likely position of this query in reference would be to just start at the beginning of the reference and see whether you have a match. So we have ATA and they're going to look for a match between in the first two nucleotides and the TAA so we say no no match. Then we continue to the next one and then we check ATA no not correct next one and so on. So then we have the third one and then we see a match. So that means that we have made three comparisons over here right to TAA, AAT and ATA. Well you can imagine you have a very long genome a lot of reads that can take quite a lot of time so we compared the read to the genome three times here. If you have a suffix array in this case my mouth is a bit sensitive. So in this case if you have suffix array we actually do only two comparisons so we take the middle and we see is there a match yes or no. Then we say okay we have to go lower than this so we take the middle again and we check it again and then we have our match. So we have only we do only two comparisons in this case and so if your genome increase in size that at the first naive way we took would scale linearly with the genome size while a binary search does not scale linearly with the genome size. So you usually only have a quite limited number of matches you have to do. So that's why making an index like a suffix array in this case and then later on verse reading transformation which is basically modified suffix array makes your searches faster. And then each and every read will go through the entire process and hold reference genome like that. Yep okay yes thank you. All right. Michael. That's the choice of the software for the liners. Is it influenced by whether you use it for eukaryotes or prokaryotes? I am not 100% sure what is the most frequently used aligner for prokaryotes, for eukaryotes, for genomic alignments you're usually using bowtie2 but BWA works similarly. I think the choices you would have to make between bowtie2 and BWA for example would be similar as choices you would need to make for any other organisms meaning that if you expect a lot of mismatches then it's probably best to take at least a local aligner or a liner that can be local. If you expect a lot of complete matches and I think that often the case in prokaryotic genomes then you can take a global aligner. Sabrina. Yes I was wondering for the reference genome I know that there's a different type which one would you choose what's the difference but there's always new versions and all that how do you decide which one to use? So in terms of reference genomes usually let's say the consortia that are responsible for reference genomes they do not they only change reference genome if there's really a significant improvement because changing a reference genome usually has a huge impact on running projects meaning that all coordinates change and you would need to change all the different cells so if you have alignment cells you would need to change them maybe redo the alignment if you have VCFLs that would need to change so usually if you start a new project it's good idea to take the newest because the newest usually has significant improvements if you compare it to the previous reference genome. Okay there's never a good reason to take an old version always the newest one right? Well if you want to compare for example to previous results of course yeah or if the entire project was based on that that older reference genomes in the meantime there was a newer then it's always good idea. On the other hand the annotation so if you are for example interested in specific genes they do change quite often and there there doesn't need to be a significant really a very significant improvement before a annotation version changes because usually doesn't have a big effect on downstream nets because genomic coordinates they do not change it's only for example do you assign this read to gene A or do we rename the gene for example or do we have an additional isoform of gene for example so usually that doesn't have a huge impact on downstream analysis. Thank you. All right no more questions great I see something in the chat and they're not exotic organisms no I was talking about exotic plants so all the plants I've worked with have been very exotic so that's why I was mentioning nice. All right so let's go have a look at the exercises so what I would like to ask you to do is to start with the exercises regarding read alignment so what you will do is download an E.coli reference genome using E.search and E.fetch so this entree E.utilities software and then make an index check out how this well how the valves that are there that actually represent the borough's media transformation how they how they look like and then actually align the reads with both that too so in this case we're using a prokaryotic organisms and we're taking a global aligner so both that do aligns globally by default.