 Okay, so my name is Mathej Tavit. I'm a train and computer scientist. I'm working at OICR in computational biology. So I'm saying we're like the PI that I'm working with, Jared Simpson, who will be speaking later today. We are developing algorithms for assembling and new data and putting together data from different sequencing technologies, like Illumina and Pistifalus Sciences, and most of them are for me. Today I'm going to talk about reference alignment. So that's all I want. Okay, so the objectives are to, in general, to understand the problem of alignment. I'm going to be talking in particular about Illumina sequencing data, and Illumina data has certain types of errors and certain types of characteristics, so I'll be speaking about those and not so much about other sequencing technologies. I'm going to learn some terminology. We'll see some, the common file formats that you would have to deal with, and during the lab we'll just run actually some alignment. By the way, so when I, alignment and mapping, I consider them equivalent things, so read mapping, read alignment is the same thing. I think I used the term aligner throughout the presentation, but read mapping is just the same thing. Okay, so let's talk about Illumina sequencing first. There are all sorts of sequencing technologies out there, and they're evolving. And some sequencing technologies are emphasizing throughput and gigabases per run, others are trying to maximize read length. And then there's all this picture of all these trade-offs in between them. We'll be talking about Illumina today. So I'll just give you a very brief description of how it works. Very mind, I'm a computer scientist, so I don't really know. So probably some of you know better than me, so with that in mind, I only need to describe as much to touch upon like the source of errors, and how those impact read mapping. So starting with genomic DNA that is fragmented into little fragments, 200 to 300, or maybe up to 500, 600 base pairs. Those fragments are taken and their adapters are attached to them. And then they are put on a flow cell. On the flow cell, there is PCR amplification going on. So individual DNA molecules, which are represented by this single strand here, they are amplified into clusters, which are these guys over here. In such a way that ideally, all the molecules in the same cluster are, so all the molecules in the same cluster should be the same. They're not always the same because they're all sorts of errors, but that's the IPLC, okay? And after that, the slides are placed into the sequencer and the DNA bases are read by what's called sequencing by synthesis. So we have a single strand of DNA hanging. And individual bases which are labeled with some fluorescent molecule are added. So here we have the base is an A, so a T comes because it matches it and it's attached here, a light is applied to the entire plate. And an emission, a light emission is produced, right? So this is what we see here. And the process continues base by base. So in the second round, another base is added, it will be a G and then another C and a P and a N, and so on. And this goes on for all those clusters and more like the entire plate at the same time. So this is what the sequencer seems like this nice dots colored dots. Okay, base calling is the process of translating these kind of images, which are seen by the sequencer into read. So the machine sees this and it produces letters, which are these guys over here. And also it estimates the errors for every single letter. It gives an estimate of the error that the base that it produced might not be the real one, okay? And that has to do with the colors that it sees. Okay, so what I want to get to is types of errors that are specific to Illumina sequence. Okay, so in here, in the A slide, we have an ideal scenario. We have a cluster with three molecules and the sequencing is all in sync. And all three molecules are like the red label base is being added to all of them. And then the light shines and all of them emit the red. And all of this is very nice and clear and produces a red color which is red by the sequence. Okay, so that's the ideal of B. B is a case where the different, even though the molecules are the same in the cluster, they are disphased, right? So the same, the letters are the same. But this process of adding a base and shining a laser and then adding another base shining a laser, this became the face. So this one is now shining a red light. And this one here, it skipped the red thing and it's shining a blue light. And this one maybe it's the icon base, so it's shining a green light. So then the sequencer moves at the slide and it seems like in the zone of this particular cluster it sees some color, which is a mix of all these other colors. So then it will not be sure which color is the real, right? So it will produce, it will probably choose one of them. And it will, ideally, it should say that, well, I'm not so sure about this particular base. The C here is just like a loss of signal. So here, initially, there were enough molecules, but now some of them broke off. So there's only a few of them left, and maybe the signal doesn't have as much intensity as the sequencer expects. So it might not know that the sequencer might be confused by that, so it might be errors. And finally, it's a score for cost. So I think it's a generic thing, like maybe the molecules which are being added are erroneous in some way, so they're shining the wrong light. The idea is here that sequencing, even in sequencing mainly has substitution errors. So it's like, even though the, as it goes to the stream, usually it's certain bases will be wrong, as opposed to having insertion or deletion errors, okay? Also, okay, yeah. Other technologies are not, they have their own processes, and because of that, they have their own error. It's a piece, you know, so it's, there are several stages. So first, a base is added with the fluorescent label. In the next phase, those things are cleared. The fluorescent things are cleared so that in the following phase, I can do base, get the fan. So it's, so it efficiently, yes. So actually, as the, so the, I think in Illumina, the clusters are sequenced from the hanging top down, this is the plate where they're attached. And I think the sequencing goes from like up to bottom, so they're like going in here, and then the thing goes down. And just because of this process, you can understand that there's more chance of the bases at the end of the ring to be of poor quality and have more errors than the ones at the beginning. And the reason is because of this phase, at the end of the ring, it's more likely that at some point before the various molecules become these things. So that's why in Illumina, so we see one of the characteristics is that we have substitution errors. The other one is that the bases at the end of the rings are poorer quality than the ones at the beginning. Now with the data. I don't know if it's just a question. The question was whether the phasing, this depicting process increases towards the end of the ring. I'm not sure if it doesn't have to be like, even if it's constant, at the beginning here, you only went through three phases, which would have had to be done at the same time. Like by the end of the ring, you had to have done like 100 or whatever the ring length is. So even if the phasing happens like with a constant, it's like a constant problem, it's just that you really put the end of the ring. Others are always the same. I'm not sure. It's just inside the ring. It is fixed. So is it ever possible? So there are, of course, there are errors during reading and there are also errors during, like during the cluster. You might want to be able to see them like in your files, like if the cluster is really bad, and the sequencer is really confused, what's going on with that particular molecule, it might be that the internal quality can just see the one thing. But yeah, there are errors during reading and there are errors. Okay, so this shows you, just an estimate, this is the base position within an Illumina ring. So it goes from zero up to, like, there are a hundred, these days there are a bit longer, a hundred and fifty. This is from, I think, years ago. So the newer read lines by Illumina are a bit longer. It goes up to two hundred and fifty. And the various curves are different organisms, and I don't know which one is which. But ideally, the, what I'm used to, the way I think about human sequencing is like one of these things that are. So the quality is really good at the beginning of the reading and then it starts falling down towards the end of it. And that's why, actually, during the lab that we're going to do, we're actually going to do some read trimming based on this quality. So the summary is that Illumina has, like, a low error rate of about five percent of the bases are neuronias, and this differs from other technologies where, like, these entire discussions just don't apply to them. There are other technologies, like, these are long-read technologies which have much higher error rate. And they, but they look, they use very different sources and divisions. Some substitutions, but not as many. So third reads, it's like another trend we're going to be hearing about. So Illumina sequencing produces third and read. So what this means is that the genomic DNA is fragmented, the adapters are ligated, and then these clusters are built on the flow cell that looks like this. And for every one of those clusters, or, like, every one of those moments, it's sequenced twice. So it sequenced once from, this is like a primer, sequenced in primer one to primer two. So once they're sequenced from one end towards this side, so from the overhanging end towards the flow cell, then this cluster generation is repeated, and then the molecules are sequenced from the opposite side, so then from sequence primer two towards one. So for that reason, for every one of these DNA molecules, it's going to get read twice, and the direction will look something like that within that fragment, 300 bp, or 500 bp, right? So that's fair end. The alternative to that is what's called made-fair sequencing, and the reason for this is that when you're trying to assemble a genome, if you only have information from 500 bp long fragments, that might be not sufficient to resolve repeats which are longer than this length. So usually when you're trying to assemble a genome, you want to have information which has longer range. So for that reason, there's this other way to do made-fair sequencing. So the genomic DNA is fragmented, now the fragments are, let's say, 2 to 5 kilobases, or even more. At the end of these fragments, there's a bioteam, they're labeled with bioteam, like these ends, then they're circular lines, then these circles which are 2 to 5 or even more can belong, then these circles are fragmented back to 400 to 600 bases. And then out of this whole mixture, if you look at these circles, the bioteam label will just occur like in one point. So then the other things which don't avoid are washed off, then at the end, you pick these fragments which are 4 to 560 base pairs which are labeled with bioteam. And then these things are fed back into this at this stage of the sequence. And then from here you go here and then you do the same thing. So what happens is that because of this process, these fragments, but now the two arrows of reading will be pointing towards this center bioteam thing. And then if you undo the process, you will get reads, you will get paired reads which are reading these large fragments but they are pointing outward. So they're pointing like in these things like you're going to get a reading point that way and here you're going to get a reading point. But the point is that the two reads which come from the same fragment will have the distance between them on the genome will be much larger. It will be these four. I don't know exactly how this works. I just know that the machine produces two reads. So you're going to get two reads. From a mapping point of view, what I know is how they should be mapping on the genome. I don't know exactly what that means in the middle and how the machine, probably it has some internal way to defect to skip the bioteam so that the read doesn't go through the bioteam more like the other side. So I'm just guessing because during this step, when you have these circular molecules and you fragment them, you can't really make sure that the bioteam would be in the middle of the fragment so you could be close again. So it might appear in sequence. Yeah. I don't know if... I think the machine would be... Okay, so those are... when we talk about pairs and the main pairs, these are the... this is what that terminology means. And when we go to read mapping, the read method like the Reiner will need to know what type of reads it should map to the gene. I think these days, the aligners that we're going to use, they know how to both detect this kind of stuff. So you keep them like a bunch of reads, being fast with file, and they will detect this stuff. They will realize that those two arrows are pointing toward each other at a small distance or away from each other like this. But it's just so that you know what this is all about. Okay, so fast with format. This is the format in which reads are stored in the files. Okay, so this is what an entry for a read looks like. Have a label over here, and this is a read sequence, which are the basis. Then this is a class, which is a comment or it's actually an ID or a label for the quality, for the sequence of quality, basically, which follows. And we're going to see those in the lab more closely. Within a file, so this is what the file looks like, so this is an entry for one read, the score line, this is another read, and another read, and so on, so they're like consecutive like this. Right, so the base quality scores, which is this score line, the fourth line, that is an integer. It's a way to represent the probability that that specific base is error. Okay, I forgot to put, I did like a formula here, but anyhow, so you were going to see it in the lab, I guess. So given the probability of the error, the base is wrong, being something like this, that gets associated with an integer base quality, which looks something like this. So the conversion is something like minus, quality is minus 10 log base 10 of the error. Minus 10 times log base 10 of the error. Okay, so this formula translates like a double precision probability of error into an integer, which is usually nice to be made between 0, 40, or 60. And then these integers are encoded in these past few files by a scheme which is usually Fred plus 33, which means that the encoded character, Q, so that the quality Q is encoded as a character and then there was another scheme which was called Fred 64 for ordering in our data. This is a nice scheme of the past two Wikipedia page. Here you have the ASCII characters, like the order from code 33 and so on going up, and this is showing you what the quality values are for every specific character under these schemes. So these days, even Illumina, like the recent machines they produce reads with where the qualities are encoded as Fred plus 33, meaning that we're talking about this range of work here. So from the explanation bar to capital I. In the past, if you're dealing with older Illumina data, they used to have Fred plus 64 which means that the range of character you see would be something like this. So just so that you know what those things are about, if you look at the two ranges, they have many of these uppercase letters in common. The Fred plus 33 has the numbers over here and then the Fred plus 64 So if you just want to look at some data and decide between which one you think so this one has numbers, right? Those characters which are numbers so that's Fred plus 33. And again, the aligner should know about which Fred you're using but usually, I think these days it will hopefully know from it. Fast to fast, like how do you store and make fair reads? So usually they're stored in separate files. So you have a file with read one and a file with read two. Usually the ends that read the names end with this slash one slash two, but that's not always the case. And sometimes you have the fair reads interleaved in the same file. So you have the two reads which are fair, like for safety people you have the next two reads and then the next three, the next three. And again, the aligners they have some options to tell them you have to tell them how to read the fairs, hopefully. And they will most of them when you compare them read to just even two different fast files. Okay, reference alignment. So the goal is what we align to the reference. So usually we're trying to infer variations in a donor genome which were interesting. And we need to find reads that come from certain genomic regions of interest. In rare cases reference alignment can be used to actually reconstruct the donor genome as a first step in some other longer process. The issues with alignment are that so genomes are large and repetitive. There's a lot of errors built together. So the names are the same and also I think the files are the same and the reads should be in the same order in the two files. They shouldn't be mixed up like so it should be read one on fair one, read two on fair one and so on. Any other sequence or no? Yeah. Well, I think the sequence is a little bit long. Oh, okay. So Yeah, so during the amplification I think the molecules in the cluster are the same cluster. Okay, so the issues that we have to deal with when we read reference alignment are all these differences. So the differences between donor genome which are uh, single nucleotide polymorphism in those structural variations are larger in those all sorts like with um, inversions and so on. There are also differences between donor reads. So there are all these variations plus sequence there. So an alignment has to deal with all these differences to find the location where to put, okay, uh, the main steps like in general, you know, getting into any details, it's like all the aligners they try to construct some kind of index for the genome at the beginning, for the reference genome, and that you can reuse for uh, different runs. So we see the reference with an index to reuse that. And then for each donor read they identify genomic regions where that might be aligned and those are done in all sorts of ways like using hash tables or whatever. Um, pairing information is, if it is available, it is used to reduce the list of candidate locations where you might place a second read. And then a more thorough alignment is done in those regions of interest and this step is costly and the reason you don't want to do this all the time is because it would be too slow. So that's why we do this. And then various aligners have various stroking criteria so you have like all these candidate locations to read, get aligned, to read, get aligned, get aligned better and at some point the aligner has to stop and say okay, this is the best location or these new locations are the best location for these three. And again, the aligners have different features such as some of them secondary alignments so they report these three get mapped here and here and here and here and here or just they just give you the one location which is the best alignment and some of them they speak alignments and these are relevant for our main sequence where we're trying to map across spliced. So I've given a read this is just a schematic of what I was saying. This is a read here, this is the reference. The aligner might identify three locations candidate locations which look like this, so this one has one mismatch, this one has two mismatches but this one aligns properly. So in this case probably the aligner will say this is the mapping location, this read comes from this position, the reference, you know because this is the best position. So we'll get to that. So again, during all this process you have to keep in mind what we're trying to do. We have reads from a donor genome which are mapping to reference genome and sometimes the reference will not contain all the sequence which is present in the donor so that might be a problem. How do you map paired reads? So looking the aligner looks at the pair, let's say it looks at the first read that defines two possible locations in the reference and then it looks at the second read in that pair and it finds that here it's kind of a really bad match, like the too many mismatches but here the paired read maps perfectly. So then in this case the aligner probably decided this is the location where this read is. There are also two properties of aligners to consider. So one is accuracy like these align reads are also post-positive for very close down the line, like in the processing pipeline. Sensitivity has to allow for all sorts of variations. For humans you have one type of variation that you expect within humans like other species that might be more or less. Speaking of large amounts of people data to deal with memory or something and the question always comes like which one is the best aligner? I don't know. So I'm using I have experience with the ones which are all around the aligners but there are some which are faster some which are more accurate if you need special functionality like aligning the first place there are special aligners to do that and I'll just say series because I had to I found like one from two years ago and I was going to put it here but then I came on the page where the table was updated and the link didn't work anymore so in our lab we're just going to be using dwa and I also have an opportunity to work on it. Sand-bump format so I'm going to be skipping this because we will see this in the lab so this is what alignment format looks like so you have reads there are certain flags we will see in the lab chromosome position where this read is mapped which is 19 in this case position within that chromosome this is a mapping quality this is a string which describes how the read maps to that location and these other things are information about where the pair where the other read in the pair is being mapped here equal means that it's the same chromosome 19 this is the position where the other read maps and this is the difference like the size of the fragment is the sequence of the read and then is that the base quality the only thing I wanted to mention here before you go is this mapping quality view so this mapping quality is again it's a thread encoding of a probability the alignment location is wrong so it's the same formula so mapping quality of 30 means that one in a thousand alignments with that quality will be wrong okay and what I wanted to emphasize is that mapping quality and base quality which are these things are kind of different things so these ones they refer to how the sequencer read that particular base and whether the signal the light signal was clear or not right this mapping quality has to do with all other locations where certain the read can be placed in the genome okay so a read that comes from a repetitive region of the genome which occurs a lot in the genome it might have great base quality so this could be all like 40s base quality like all the bases are clear like the process went very nice but the mapping quality would be zero because the alignment would just look at this read and it would find a hundred or a thousand positions where it might place in the genome so for that reason usually what the alignment does it just beats one location and it just looks like a mapping quality of zero saying that well this read maps here but I this the probability that this mapping is wrong is like really high so that's what you see conversely you can have the other sequence you can have four base qualities but somehow the sequence is like very unique that the aligner figures out like this exact location where it comes from the genome and he says well it comes from there even though there are some bases which are wrong it still comes from there because there's nothing anywhere there's no other location in the genome anywhere close from these two okay and then the last thing I want you to talk about were these sources of errors in the mapping process it's all talk about three of them one is duplicate so if you look at let's say you do reference alignment and then open up some visualization to which Mark will be talking about because I hope we don't have it in the lab the next one will be visualization so you look at the reads and they look like this so these are individual reads the colors are the directions of the reads like when some are positive some are negative strength and this is a map of the cover so you can see the cover is kind of constant and then all of a sudden it feels like this really big bump and then it continues and then you look at individual reads and this is what you see it's so what happened here is that well there is a chance you know that this in the dark process of preparing like the sequence in doing all this you share the DNA and it so happened that you got like all of these I don't know 13, 14 DNA molecules which are exactly the same and they come from that location which would be okay but more likely is that what happened is that these things were over amplified during PCR so you had like one copy of like one way like this which is one of the three things and during the PCR amplification step which are not those are not uniform so you can get some molecules being amplified much more than others so in this case this particular molecule even though there was about as much sequence this portion here heard about the same amount in the donor genome during amplification when you map the back you'll think these things and it looks as if there are more copies of this region so this is what duplicates are and there are two sources on the right is an image where like it's more normal so you still have some duplicates maybe here like some molecules which start on the same position but this is a more there's not such an abrupt increase right so this is a more so maybe this might be okay that's of course so this might be a true thing like maybe indeed this region occurs several times in the donor but here is probably not so there are PCR duplicates these are created during the PCR amplification and there also was for optical duplicates which are the machine reading the same cluster like the shiny lasers maybe it doesn't mention this picture like the location where the light comes these can lead to problems and there is some software for you which is the program we'll do that in the lab so when the picture is taken on the slide the individual clusters they shine one light like a red or a blue and it might be that image processing which is done inside of the sequencer so it will somehow lead that light into two spots so it can produce two spots even though like in reality there was just a bit I think so yeah yes so there's some information about the location where these reads come from in the read label and the software uses that to infer that they're probable so it could be probable for I mean is there often optical duplicates I think you have to be aware of that so that's why it's a standard thing to run even if you don't have PC RAM it will be useful that these things would be so the thing I kind of know yes yes the so that's because the other thing I wanted to show is in the alignment so let's say you look at some alignments and they look like this so there's a base sequence here and these are reads with like some arrows and if you just look at them you might notice that so all the reads which kind of end on this side they show like a variation a sniff or two sniffs over here towards the end of these okay all the reads which end in this direction on this side they show a different two or three sniffs on like in here so if you see a picture like that in a visualizer that might be that illustrates a problem with, it might illustrate a problem with in the alignment so if you run some software which is called a real liner that thing might look at all these alignments and actually do this like real line all the reads so that they so what happens here is that we said at the beginning the UIMA data has is characterised by substitution errors okay for that reason the alignments are not will not so easily introduce indels if they see a read it will the aligner will try to somehow like using methods personally to try to place the individual basis of course and somehow against the genome unless they have like good evidence that okay the problem is that the aligner always looks at only one read or one read there at the time so it always has to make this call looking at only one read for that reason indel alignment is also like a standard step in this processing is a process which looks at all the reads mapped to a certain region together once indel real liner sees this kind of a picture it might aggregate information from older is to say well yes indeed this just looks like all this mess all these substitutions here here are kind of better explained by just looking like one phase insertion review here it's a deletion so that's what indel alignment is all about this picture over here it's kind of the same thing forget the top part so here you would say it's better to see it on your slide in the middle of the bottom slide you would see a big a read mapped with like a big deletion and apart from that like these ends are exactly the same thing so these are read ends which are being misplaced because the aligner cannot there is not sure that it should traverse this entire read if an indel realignment might be able to get this information and remap the reads correct I just wonder if this reads them yes, yes, so the aligners the aligners can only look at one read or read the other time that's why they don't do that the faster they have to do all the stuff afterwards, yeah that's a different step again the last thing I wanted to mention before the lab is stuff about novelty so going back to what was pointed out the reference doesn't always contain the same regions that the donor has so here we are in a chromosome next to a center here this is the position where we're looking and you can see this very wild variation we have here Calvary which is about a certain level and then you have these peaks and then it goes down and then some so what's happening here is that there are differences in the copy numbers of those repetitive regions between the donor and the reference and an aligner will just try to place all the reads and try to place them somewhere in the reference and for that reason it might create kind of garbage alignments regions another thing so I posted a link on the wiki like a additional resource so few years ago 2011 10 weeks decided that there was too much contamination like in some humans so that she created a reference which contains decoys so what that means is that there is a normal human reference and in addition to that adding all sorts of like viruses and other common sources of contamination which are common sequencing experiments and those are like they appear as individual kind of chromosomes in that reference and the moment you sequence and get that kind of contamination in the aligner we just map those reads away from the chromosomes into these kind of decoys so that you don't force them to map onto the reference and create these kind of post-pository communications ok so that's so that's what I wanted to say sources of errors are duplicate and we have software to do with that in the realignment we can deal with that and for normal sequence there isn't much to do like if you really have normal sequence in the donor which is not contamination then you cannot get a good reference because he just doesn't agree ok, that is yes we're going to see the contamination what kind of contamination do you know is there viruses which then travel