 Hello, so my name is Matej David. I'm a postdoc here at OICR. I work with Jared Simpson and I've done some prior work on readmappers. I wrote this readmapper called Shrink for solid data which is now kind of gone and these days I'm working on like the novel assembly combining long read technologies with like regular Illumina technologies. And that's me. Okay, so in this module we're going to talk about reference genome assembly and some objectives. I'm going to try to explain the reference genome alignment, how it works. I'll spend some time talking about Illumina technology, sequencing technology and in particular about the types of errors that you will see with this data and like how those errors affect mapping and like the results that you're going to get out of this analysis. We're going to learn some alignment technology and we'll see this regular file formats like FAST, QN, SAM and BAM. We're just going to have a look at those in the lab and we'll actually run an aligner. Like in the lab we actually have two aligners set up so you can run both of them. The lecture is two hours, sorry the lecture, the module is assigned two hours so I'm hoping to finish the lecture in 45 minutes and then we'll just do the lab. So this is the outline what we're going to do and let's start at the top. So we'll talk about Illumina sequencing. There's lots of sequencing platforms out there. Some are trying to address like high throughput things so they produce lots of gigabases per run. Others are trying to produce longer and longer reads and this slide I think it's from, oh it's here, 2014 so it is a bit out of date. In particular Oxford Nanoport is this new technology, it's not here. However so in our module we're just going to be focusing on Illumina data which is one specific type of data so I'm not going to be talking about other things like specific biosciences. Okay so let's just want to talk about Illumina sequencing processing first. It's just one cautionary note whatever. So my background is in computer science and mathematics so I don't really know biology so I'm guessing there are people like in this room who know more biology than I do. I just learned whatever I need as I went through so that's a sign. Okay so Illumina sequencing. So starting with the genomic DNA here at the top, how does it work? So first the genomic DNA is shared into small pieces about 200 to 300, 600 BP fragments. Adapter sequencing or adapters are attached to the end of these pieces like the ends are repaired, the adapters are attached. These sequences are applied to a flow cell. On this flow cell every single one of these pieces is amplified into a cluster of molecules which are hopefully the same which are these ones here and this is done using a solid phase PCR and then this flow cell is placed in like fed into the machine which goes on and reads the various molecules in clusters. So this is done in Illumina. It's done using photography basically. So in more detail so consider like this being the here is the molecule which is being sequenced so the way this process works is that various bases over here there's this T are attached to an existing sequence so this is the existing sequence which we are sequencing the ACGA here. So in cycles in every cycle one base is attached to this thing which is complementary to the one in this molecule and the bases that were attaching they have like a fluorescent dye on it and a laser is shown on all of these things and all these clusters produce like small dots of color and this is what the machine sees so like in the first cycle the machine sees a blue light corresponding to this to the molecule that we're sequencing here so this is this blue in the second cycle it sees a G which is this green which is this color over here and so on and the process repeats. This is the image like the the actual molecule is attached to the flow cell at the bottoms and the sequencing starts here at the top and it goes down in Illumina and just one thing to point out is that it's the the process is repeating for about 100 to 250 300 cycles so starting here at the top one of these Illumina reads will contain maybe 100 or 200 bases going down this molecule could be a bit longer it could be 600 500 so it's you want to read the entire molecule in one line okay the process of base calling so this is this is hidden for me this is done by the machine the machine sees this kind of a picture actually it's massively parallel so it does it's the cycles apply to all these clusters parallel it's the machine sees these kind of dots color dots and out of these color dots the machine tries to make sense saying that okay in this cluster at this cycle there was a light which looked kind of green and in that case are going to put a base representing G over there okay so that's that's the process by which the machines translate this thing into into base calls okay why am I telling you this mainly in order to describe the types of errors that you will see with Illumina data in particular okay so here in figure a you have a cluster so these are identical molecules are PCR amplified and the sequencing here is going just fine so all the all the three molecules are at the same kind of stage right so we are all of them are beating the fourth base from the top and all the the molecules attached they're all shining this red light so then the sequencer sees a solid red light coming out of this cluster and then it's able to call that base with certainty as being the red the base which is corresponding but corresponding to red the various types of errors are like some of them first of all so in in this slide B you have phasing errors so you have even though the molecules are the same the sequencing process it just got out of phase so here this on this molecule the we're reading the fourth base which is this red but on this one on the left for some reason the the cycling the cycle process skip this base and now the light that's shining from here is this blue which corresponds to the next base over here this the again the cycle for some reason stayed behind so this is shining a green light so at this point because of this kind of uncertainty the light which is emitted from these so all of these are emitting like one light corresponding to one of those dots the light coming from that dot will be kind of ambiguous so the sequencer will make some kind of guess as to what base is there and it will in this case it will probably say well i'm not so sure maybe this is a g but i'm going to assign it like a high probability of being wrong just because of the the light is not clear another type of error is just like loss of signal so maybe the molecule just disintegrated or something happened so in this particular cluster the intensity that's coming out of there is not as high as as the sequencer would expect so again that might confuse the sequencer into thinking it's error and finally like the even if they're in cycle there could be errors with like the um uh floor for cross so the the the light that's being emitted might still be somehow misread by the sequencer yeah it's just random what do you mean like how do these things come about are they just processing yeah so they're expecting right so the sequencing contains many of these likes right this is just one cluster which corresponds to one of those dots on that slide so the machine is reading many many of those things at the same time right takes pictures and then there's some software that goes says okay and this dot there's a red at this time and then the blue is and hopefully it's working okay for most of them but these are the types of errors that you're going to see like it's yeah you cannot control like this this will occur just by chance okay so part of the point here is that Illumina sequencing contains mainly um uh substitution errors uh as opposed to like long insertion or deletion errors which is a case without the technologies and the base calling process uh it produces not only just not only the base which is being read like the color but it also produces error estimates for that color which have to do with the intensity and how much differentiation between these between the colors that the sequencer is able to see okay so this is what the error error rates per base positions base position at the bottom this error rate this is what they look like for some Illumina sequencing projects the various lines are so these are different organisms and this slide is taken from some assembled on paper I think I think the light blue is human data and these others they're like various other organisms the message here is that the error increases as you progress down in the read so this is a hundred bases into the read so hundred and fifties over here so at the beginning the process is quite accurate but as you might expect as you go down like in these cycles like you're expecting to have all those molecules go through the cycles in like in in sync and doesn't always happen that way so so as you go down the error in the reads becomes larger and larger but in general it's still quite good so right this is 0.5 percent this line over here right this is one percent so the per base error rate stays well below that thing and in general for for Illumina reads you expect like overall it's about 0.5 percent is the error rate that we see and it's mainly substitutions um this is in stark contrast with other technologies like Pacific bands as the Oxford Nanopore which have higher error rate and they have other types so it's not just substituting they have insertions deletions and so on okay so that's Illumina sequencing another thing I want to talk about is paired reads so what are those this is the normal process which is performed by Illumina like these days so the genomic is shared into these fragments adapters are ligated which are uh different for the two ends so if there's an a1 and a2 and sequencing primers these things are amplified on the flow cell and then every one of those molecules which again are up to let's say 500 dp they are read twice so once that it's the molecule is attached to this end and it's read from from the uh from this end going uh towards the flow cell and then in another in in some other step the molecule is somehow it's actually reversed and cluster is regenerated and so on but the same molecule ends up being read uh second time from the other direction and the machine keeps track of all this and and what at the end what you see are uh paired reads so you get out of the machine you get two reads which are actually sequencing the same genomic fragment which is this guy over here and they're pointing towards each other like this so one is from one end the other is from the other end and if there are more if the fragment is larger than twice the number of cycles you will not end up reading the entire read right so there will be something in the middle which is not okay so this is and this is what's called uh paired and sequencing so that's process then there's another process called mate pair so why would you want that uh if for normal sequencing uh you might be able to use this kind of data this is uh mate pair reads are used more for uh let's say assembly of um like novel like a new organism you don't have a reference you don't assemble this read so assembly makes it's hard to do assembly with fragments which are up to 500 bp basis just because the assembler would be confused by any repeat which is larger than that basically so mate pairs are one way to um apply Illumina sequencing and obtain sequencing data which is further apart than just five six hundred bp okay so the way it works again you have the genomic DNA here and this is fragmented into larger chunks so now you have two to five uh kilobases like sometimes even larger like 40 kilobases um taking take these fragments there's a biotein uh molecule attached to this end and the fragments are then circularized circularized so that they look like this and then these circular molecules are then uh shared into fragments again uh 400 to 600 which look like this but then at this point um the molecules which do not have the biotein are kind of washed away so you you're you're keeping only those fragments which have this biotein marker in the middle okay and then these fragments over here they're fed into the regular sequencing over at this particular step okay and then you go through this process so what you're going to end up is so if the original pair at reads look like this so you have a genome region which is 600 bp and it's being sequenced uh in this so you have the two arrows pointing inwards kind of this is the regular thing for mate pair reads you will have uh the arrows will be pointing the other way so like outward right so remember this the biotein thing is circularizing so that's why um so you'll have two uh you'll have two arrows which are pointing away from each other and they're like much further apart than this other so it is two to four to kilobases okay so that's that's the difference between paired and and mate pairs okay so if you ever encounter these kind of things so that you know what they are okay uh let's talk about the fascue format so this is what the format in which regular like Illumina reads are stored the entries in one of these files they look like this is there any rule there is the number of cycles no so you never yeah so the question was uh not sure so how does the number of cycles relate to yes yes you said you know in the prayer hand number of cycles is lower than the size of pregnant yes the same so the right so in pair hand which is on this on the left side right so the if this molecule over here is 600 bp long and the number of cycles is only 150 you're going to read 150 from one end 150 from the other which is 300 and in the middle you have like a region of 300 bp which is not being read okay the same it's exactly the same thing applies here um it's just that yeah the the uh it's so this particular fragment the one with the biotein molecule in the middle the same statement will apply to this family it's just that because of the process you have to unwind that and put them like on the reference and see what like how much genome is in between them uh so you end up with which yeah with reads that are like this again you have 150 and 150 on either side but in the middle you have like 40 kilobyte up to 40 kilobases okay fast q format okay so in in this in fast q files the reads are stored so if you look at them we're gonna look at them in the lab so you have every read is stored on four lines this is a read name which so it's called a label which sometimes contains in codes the position where that read came from on the flow cell remember the the square picture with the dots so that might be encoded in the read name um this is the read sequence a cgt it's a comment and this is the quality the score the base quality values for that particular read and this is this is a way to encode those errors that i was talking about okay uh in and within this is like a bigger fast q file and you can see this is one read this is the record for another read and then another read then some so they they're like uh sequential within that file that's the sequence data sorry the for the third line yeah so the third line is um the one with a plus so this i initially i thought it's a comment like a few years ago but then i so i think it's it's a bit redundant that line and i think it comes from uh earlier uh file formats where the read seek the read basis and the base qualities were in different files and each one had its own name like so there was a label for the read and the label for the base call the base uh qualities and then the process of putting those together in that process this third line appeared um but yeah usually it's it's either empty or it's equal to the other to the label here and i think many safe to say that many tools just ignore the third line but let's not assume that so in this particular example see it's the label on the first line is equal actually to the label on the side um okay so our base quality scores um so out of the sequencing process you have estimates of errors that this particular uh base call is an error it's only a c with 15 percent certainty or 15 percent probability it's something else um the way this is encoded first of all the probability is transformed into an integer uh within this range and uh i yeah i keep uh i forget to add this formula to my slides um the formula is quite simple it's something like oh quality is something like minus 10 times log base 10 of probability of error so using of uh this standard formula so you take the probability of error and you apply this and you get like an integer um quality value which is in this range and then these uh these integers are encoded as ASCII characters within that file using two schemes like one is fred plus 33 and fred plus 64 the reason is so that you become confused that's why because no like the different companies use different encoding uh schemes uh obviously however the situation is better these days um um so this is uh on the fast q wikipedia page you have this representation of what the various characters what quality is the various characters encode uh so you have here the ASCII character characters starting from number 33 to 126 and so under fred plus 33 encoding scheme you have these characters over here are encoding the qualities between 0 and 40 um so for instance if you see in the quality values if you see numbers ASCII numbers like 0 from 9 you see that these do not belong to this other range which is fred plus 64 so you know that your data is encoded as fred plus 33 if you ever see like smaller case letters like abcdf then that means the encoding scheme for those qualities is fred plus 64 because again fred 33 ends over here so it never makes it work um these days this is better because Illumina uh which is the bulk sequencing ngs sequencing Illumina switched to fred plus 33 so if you have like recent Illumina data it's also encoded as fred plus 33 just like all the other sequencing uh data in the past if you if you ever look at older Illumina data or download it from like some public repository you might have uh the qualities encoded as fred plus 64 okay uh paired and then made pair so how are these encoded uh usually the corresponding reads uh they are given to you in separate fastq files which are kind of in sync so the first the first record in one file corresponds to one read the first record from the second line file corresponds to the second read of the same fragment right and then they go in sync like this and rarely the the the reads from the same fragment are interleaved in the same file but that's that's not usual and oftentimes but not always the read name is ending in slash one or slash two so you might open one file and you see that all the read names end in slash one and if open the other file you see that all the read names end in slash but that that's not always the plan um okay so let's talk about reference alignment it's all about uh Illumina so what's the goal why do we do reference alignment so in general we want to find reads that come from certain uh genomic regions of interest like genes so we have a reference genome and we know where the genes are on that reference we want to find the reads that come from a particular gene in the donor genome so the goal again is to infer variations in the donor genome sometimes we want to reconstruct the donor genome so like to do assembly and uh some assemblers do uh start by mapping the reads to the reference and then processing them from there uh what are the main issues uh with this well first of all who cariotic genomes are large and repetitive and uh mapping to repetitive regions is very error prone there's large amounts of data and then there's all these differences so between the reference and the donor that you're sequencing you have snips in those structural variations and furthermore between the reference and the actual reads that you see you have all these variations plus the sequencing errors on top of that okay so there's all these things so the the alignment process has to allow for all these differences uh the main steps because I'm talking here about the generic alignment process not about any particular aligner um so the the way this works is uh the the aligner constructs some kind of index of the reference sequence so it takes a reference sequence it processes it like once and it creates some kind of structure uh that it uses internally and then for each sequencing read that you give it identifies regions of interest for alignment and this is the most uh important step and various aligners differing how this is done like sometimes they uh some ask for exact matches of a certain minimum length or for several matches in close proximity of again of certain lengths and so if pairing information is available so you have like a read paired and or made pairs this information is used by by the smart assemblers to reduce this list of possible alignment locations the third step conceptual step is to actually align the read to that particular to all the regions where it might go and this is a very costly step right so the speed comes from reducing this list using all sorts of heuristics and then the next various aligners have varying stopping criteria like some give you all the alignments for a read some give you just the best alignment there's all these options um some aligners do again the secondary alignments like maybe they split the read if it's like a long read it 150 maybe it can report the aligner can report a mapping of 70 bp to one location and another 80 bp to like a completely different location that that kind of stuff speed alignments so and these these kind of options differ from alignment to aligner um right this is just a diagram of what I said so this is a read and this is the reference and the aligner finds three locations like one two three in here and here the read aligns with one mismatch here it aligns with two mismatches so probably in this case the aligner will report that the correct mapping locations is here and it will attach a certainty like a a quality value which is a certainty to this assignment it will say this the mapping is correct with a certain probability for the reads okay looking at the first read in the pair let's say it matched to two locations in the genome both in both locations it has one of this mismatch but then looking at the second read of the same fragment it the read is mapped in this location with very with like no mismatches but over here it has lots of mismatches so in this case the read aligner will probably say okay this is on the left is the correct mapping location for this for this paired reads various properties which are relevant to aligners are so accuracy why because misaligning reads to the reference that's a source of false positive variant calls like down like down the pipelines sensitivity you must allow for variation between reference and donor so for humans humans are like pretty the expected difference between a reference of the donor is quite small but for other organisms like that that can be much much larger I was working once with like some Siona data which is this the organism had like seven percent difference between two individuals is huge so the the more the organism like the the more the references donor differ the more you want to allow the the aligner to be sensitive speed is always a factor has two large amounts to data to process and memory because compute nodes have limited resources so first question is which one is the best liner I don't know the ones which we will use in this lab are bwa and bowtie which are good kind of established all around the liners but there are better liners if you're looking for certain things like for speed these are faster these are more accurate sometimes and if you're looking for special functionality like dealing with you know RNA-seq data and you want to align across splice junctions then there are separate aligners that offer those kind of options okay and you can just see a recent survey for comparisons I was going to put one but for some reason the page that the paper pointed to is not available anymore so um okay are we doing on time so sandbound format so this is uh once you run an aligner then you have to the aligner what what it does it produces this kind of description of which read maps where in the reference genome okay and the way this is described is in this standardized SAM format which is a tab delimited text file BAM is just a compressed binary form of that same thing and the way to access both these types of files is this SAM tools toolkit which we'll use in the in the lab and this can be used to convert sort alignments extract alignments from a given location and so on this is what uh so this entire thing over here this is just one line describing one mapping this is part of one of these SAM files okay so this is just on one line it's not you know five lines over here just run out of space um and the main things are so this is the read IDs over here this is a flag uh which is a numerical value and this encodes things such as the on which strand of the reference that read is mapped to like the forward or the reverse the pairing information for that particular read uh and some properties for the alignment this is the main or supplementary or duplicate and so on so these are all encoded in this integral value and we're going to look at this in the lab next two things are this is the chromosome so in this particular uh case the read is mapped to chromosome 19 of the whatever reference uh position within that chromosome a mapping quality again this is a fred encoding of the probability the alignment location is wrong so uh this is one of the important ones so again fred is just this same uh formula so it's an encoding of a probability of something being wrong as an integer okay so in this case 60 a fred value of 60 so uh to get the probability you have to do the reverse of this right so probability of error is uh what is it so it's q divided by minus 10 to the power of 10 yeah so in this case if I have a mapping quality of 60 and you apply that formula you're going to end up with probability of error being about 10 to the minus so the higher of course the higher the number the better the map the more certain the mapping and in this case what this the way to interpret this particular thing is that out of one million mappings that look exactly like this you expect one of them to be wrong okay that's what 60 means um usually like in different analyses if you have a read which is mapped to a repetitive region or you have some kind of artifacts the mapping quality the aligner will report a very low mapping quality so usually when you're doing an analysis like what what you do is to drop all mappings which have mapping quality less than a certain threshold like 10 or 30 or something like that right so that's usually the first few times this is what it means so 30 again 30 means one in a thousand of those mappings that you keep might be wrong yes so uh yeah so that's yeah so the question is what if we have a read which is perfectly mapped to two locations in the genome okay what what do you think the aligner would do in that case so it's in G so uh so the question is like what would the aligner report so you have the same read map to exactly map perfectly to two genome locations so what it does what the aligners will do uh if you're asking for the single best mapping location which is usually the case like sometimes you list all of them like you want to see all of the alignments like especially for some projects you want to uh uh you might want to do that but if you're just asking for the top of the location the mapping the mapper will produce either of those locations and we'll assign it the quality of zero uh technically um if you think about it the probability that the mapping produced by the aligner is wrong is exactly equal to one over two in that case let's say that the read comes from either of those two locations and nowhere else so in that case the the record that the aligner produces is wrong with probably one over two if you apply this formula to to one over two with probability of error one over two you're going to get like something three points something less than four so technically probably the uh mapper should report the quality of three or four but oftentimes in this kind of ambiguous cases just trims that down to zero so it's completely uncertain the mapper is completely uncertain that the alignment record that it reports is is like it is completely uncertain about it yeah yeah okay next is this uh cigar string which is describing so remember we have the read and the reference so this cigar string is uh describing the differences between the reference and donor in this particular mapping so in the in this example over here we have 76 matches so this read sequence is matched verbatim to 76 consecutive basis of the reference unfortunately this m doesn't tell you whether the base is uh equal or it differs from the one that is being mapped to so m uh m might stand for both basis being equal and basis being different okay so you can just like look at this and say that they're all equal um these are some other cigar strings like examples this is the reference of the donor so here you have four matches four by one deletion so this is the cigar is with respect to the reference so this base this t over here is deleted in the donor and then you have another six matches right in this case you have uh oh here you have like one base which is inserted which is this t and you can see in this case that the first four bases they're not all equal so the first four bases are mapped to the first four bases of the read however you have a difference over here between a and t right so and that's not captured by the cigar string okay uh the next fields in this uh uh um entry are the positions like the mapping positions for the mate and this equal means it's mapped to the same chromosome this is position of the mapping for the mate read and this is the fragment size so fragment size remember it's that is that fragment that we saw during the sequencing so this might be larger than the sum of the cycles so this could be up to 600 it depends on the the the way of the the library is prepared okay and then this is the read sequence and these are the base call again qualities so we're going to see we're going to look at some of these files in the lab and the last thing I wanted to talk about in the lecture are some sources of errors like in this whole process uh first of all when you start with sequencing reads the the reads that you get from the machine might be wrong in in all sorts of ways okay so you might have uh base qualities which are lower than what you would like or what you'd expect and those qualities could be low in certain positions in the read or you might have some kind of contamination on the flow so like maybe there's some substance over there which is interfering with the sequencing process or you can have overall base low base qualities you can have contamination either from other organisms or you might have contamination with adapter sequencing sequences uh uh so these days usually the the adapter sequences are being removed by the like from the data that you see but sometimes they're not or maybe in the past if you download again you go download like an older data set that might have contamination with the adapter sequences that are are used by the Illuminal process uh because of that so the first thing we're going to do in the lab is we're going to inspect the input reads with some software which is called FastQC and that produces various types of reports and you can see some some kind of these uh things errors you can they can be captured by these things so we're going to look at those in the lab um so that's about the input data now what about the mapping errors so let's look at these two uh uh situations over here so left and right in both cases you have uh this is the uh the coverage right so this is the amount of reads mapped to a certain base in the reference and if you have a graph that looks like this on the left and here one on the right so now the question is what do you make what do you think this kind of a bump what does this mean so that can it possibly mean that there is a let's say a duplication in the reference like maybe this region occurs twice in the reference so then it's being sequenced more than then the the region around it uh or can be can it be that case in both left and right so i'll i'll just that is the answer so the the problem that you might see with Illumina sequencing which is exemplified in the left is that of duplicate sequencing duplicate right so um during PCR amplification the various molecules which are attached to the flow cell are being amplified uh at different rates so PCR uh so some molecules might be favored by PCR so that they get you get more copies of those molecules than than others okay so in this case after mapping if you look at all these sequences over here so they all map to exactly the same location in the reference okay so that's like a big warning sign that all of these molecules might come from the same fragment before PCR amplification okay so that and that in turn means that this bump over here could be entirely artificial so you might be seeing this bump in coverage just because the fragment at that location was uh somehow favored by the PCR amplification it got overrepresented and you see too many copies of it yeah i thought that there are some recent Illumina technologies that don't have the PCR step anymore they just sequence whatever yes yes so yeah so that depends on the library preparation like some steps they do not have PCR um others do so i'm just telling you yeah if you if you have PCR steps you might have you might have this kind of stuff um it also also from what i understand it's there are two types of duplicates one are come from PCR through this process and i understand that you can still get PCR amplification because the like the way the camera is looking at the slide of dots it might just read the same dot twice so you get optical duplicates that come from the same region on the on the flow cell so those are called optical duplicates as opposed to PCR duplicates those you might have all the time no matter whether you have PCR or not okay so uh the reason for this thing sorry the the way to deal with duplicates is to run some kind of software in this case we'll use Picard which finds and marks this kind of duplicates so that you can remove them so you get better estimate of the of the actual copy number like in your analysis downstream uh indel alignment okay so let's look at this picture at the bottom so forget the top half like look at the bottom half so uh remember i said that uh the primary source of errors for Illumina data are substitution errors as opposed to indels okay so that fact is used by the aligners when they decide how the various bases in the read are mapped to the reference and the specific error that we're going to see in this case is that uh substitutions are much more likely to occur than the there is the insertions like with Illumina bases so okay let's see what happens in here let's look at this bottom part so you can see there are one two three four five so there's an about nine reads over here who's who have the right end of that read like in this same region okay and they're all clipped like this is this is a visualization tool that you're gonna learn about in a different module but basically what it says is the colors are saying that those bases are not mapped to that particular location but what's weird is that all the reads in this region that end here have the same like right see the bases are the same like the colors are the same they're aligned the same thing happens for the reads which end in this region where the left end is in this region so here you have one over here and then another three right let's forget about this read maybe that's erroneous so what happens here is that the the aligner is taking each one of those reads like one at a time and is trying to map it to the reference and it finds these kind of errors and it because of the scores that it's using it says well this is a substitution error and then this is another substitution error and so on okay one thing to keep in mind is that the aligners the aligners are always processing one read at a time and then perhaps we're using with its mate pair but then they never look at two different reads okay so the aligners take one read do it process it map it take another read process it map it and so on so that's why the aligner never sees this kind of bigger picture that you see here okay so what happened in this particular case you can see this read it's like right in the middle over in here so you have a read which is mapped over here and then this is a big deletion and then the read continues over here right so what happened in here is that you actually have a bigger deletion in the donor with respect to the reference okay and because of the scoring seam and the underlying assumptions about Illumina data the aligner is prefers to map all these reads with substitutions rather than deducting that there's like a large deletion like that okay so again this this type of situation can lead to errors like variant called errors downstream so it's one of the things that we clean up like at the right after mapping this is another example of the same it's exactly the same thing you have reads which end whose left end is here they have one error and then reads with the right end over here they have a different error and then if you realign them you observe that this is all resolved by just calling like one deletion over here okay the software which does that which we're going to use is called gtk it's like a standard tool and that software it will be able to it's able to look at all the reads mapped to a certain region of the reference and realign them locally so again this is something that the aligner never sees the aligner only sees one read at the time so it's never able to to do this kind of thing friends okay and then the last thing is novel sequence and this is the hardest kind of to deal with so when you have a reference genome and a donor most of the donor reads will map somewhere in the reference okay but it will always have regions in the donor which are completely not represented in the reference like they're just not there just because because of differences or maybe the reference is wrong so for that reason you might have in this image let's take this example we are this see this red dot over here the red bar that signifies that the region we're looking at is right next to a center mirror in chromosome two in this case this is chromosome two so it's next to the center mirror of chromosome two and you have this kind of coverage map right so this every line here is a read map to this region and you have these bumps and over here you have some really huge bumps and so on so what's happening here is that most probably the donor and the reference they differ in the number of copies of some of those really repetitive regions next to the center mirrors right so maybe the maybe the donor has more copies and what happens if you'd have that you'll have more reads coming from those regions and then the aligner will try to force those reads to be somewhere in the reference but it doesn't have that many possibilities so then it ends up creating these huge bumps in these areas which are a lot of donor reads mapped to the same location okay this is not so bad if you see this kind of picture it's it's very hard to do any kind of variant calling in this in this area so just because yeah so the only thing you can infer here is that there's too much variation between the donor and the reference in this region to say anything smart about it this is so this picture is only showing the situation with some repeats in your center mirror region but in the donor you might have completely novel sequence which is not representing the reference and it doesn't have to be repetitive okay when you're trying to map it to the reference that those reads might end up being mapped to like completely wrong things okay but that's that's something you really cannot do anything about right because you're just trying to map to a reference and you have sequence which is not mapping there so you don't know what to do okay so there's no magic wand to like solve this novel sequence because that's one thing you have to deal with when you're mapping to the reference okay so that's that's that any questions if you don't we are ready to go into the lab yeah one reason so okay so the question is how does this if i understand so how does this picture over here how is this related to gene duplication or kind of for example in our bacterial system if you do random duplication sometimes what happens with the transposones same sequence can get inside to the operand of different position so if i want to identify that the transposon wins in the only one operand or in modern model so if i get something like this should i say that so more than one so no so what i was trying to show with the left and the right panels here was that the left panel is much more suspicious of duplication than the right panel so you might have legitimate copy number variants between a donor and the reference right you might have genes duplicated regions duplicated right so the the donor has two copies there's only one copy in the reference and because of that you will see a bump in coverage right so that's kind of what's the right image is showing the the problem with the the left image is that when you map these fragments they're not just all mapped to the same region but they the exact break point where their map is the exact same right so this is why right see they're all aligned over here so that suggests that the fragment during the luminal sequence is that they can't they come from the fragment is the same and again that that's an inference that it could be of course it could be then you take the genome DNA you share it and you have two copies in the donor and it so happens that the sonication whatever produces the exact same fragment with the exact same break points and you have two copies of that that's that's also possible and but it's more likely in this case that if you have the exact same break points that they come from the same fragment prior to PCR so that's why we call them duplicates so again the difference is that the it's about the break points here like on the on the right you still have some PCR probably PCR duplication over here there's like several molecules which start on the same thing but overall the the increase in depth is like much more gradual than this this kind of steep thing