 Okay, module five will be also talking about variant coding, but another type of variant coding, which is a structural variant coding, which I have to say as my preference, it's more harder, it's more complicated, it's more wider in terms of what they are covered, but it's more challenging, but I prefer it. So in that picture, I will try to make you understand what are structural variants first and how we can discover them using NGS data. And to that, the different strategy you can have, what are strong and what are weakness and a bit of how we can see these variants in IGV. So what are the structural variants? Genomic rearrangements that are usually more than 50 base pair long for indels, and could be so deletion, insertion, inversion, mobile element, modification, duplication, or translocation. So just to give you the details, at the beginning, since recently, structural variants, and especially copy number variation, were detected like that. So by fissure approach, carry typing, so it was not really evident, also a bit of some array, so it was not really an efficient way and a good resolution way to detect structural variants. So it's where the next generation sequencing has really been a game changer for that type of variant. So just to give you an idea of what's the impact of structural variants, a lot of people that are looking at structural variants are most of the time working in cancer, because in cancer, there are a strong, strong impact of a structural variant and then copy number. So this is an example of a carotid of a cancer patient. And you can see that chromosomes are totally break out reassembled and duplicated or deleted, and it's a big mess. So it's really interesting to study structural variants when you are working on cancer projects. So the class of structural variants, so we have the copy number of variants, so the syngvies, so deletion and duplication, so it's a large segment of the genome. We have the copy-neutral rearrangements, inversion, translation, and the other type of structural variants, so insertion of new sequence, like a virus or whatever, and the mobile element impact. So this is the same classification of structural variants, but what they represent is what they represent in terms of genomic structure. So what is really important to understand when you study structural variants is when we talk about structural variants, we talk about what we see in the sample when we compare to the reference. I mean, here we see deletion, but in fact there's no deletion in the sample. The sample is DNA like that, it's deleted when you compare to the reference. So the reference has something more, when we have an insertion, we have something more than what we have in the reference. So think about that, it's always referred to the reference genomes. So here you've got the insertion of genomic sequence or mobile element, tandem duplication or in-sponsored duplication, inversion, or transportation. So as I say, NGS was a game changer for that kind of analysis. So we did prototyping, then we did fish, then we do for the CNV, we do then CGS charade. Then we did steep array, and now we are working with isoput sequencing. Just to give you an idea of what type of information we have when we detect CNVs. So CGS array, you just look what the amount of DNA you've got at each position. When you look at steep array, you take the same information, but you also look at the analysis frequency. NGS, a lot of methods for NGS are similar to what they've done in CGS array, because you look at the read distribution for copy, but we now try to develop methods that are doing a mimic of what we have in the steep array. So when we detect method in NGS data, we can detect almost every type of variant. So the point mutation, as we did previously, but the indels, the copy number, or the other type of structural variant. So what are the strategies we use to do structural variant detection? So there are four major strategies. So the read pair strategy, I will go in detail of each method after that, but the read pair is to look at how you read maps between the two pairs of your reads, the read depth, accumulation of reads, split reads, how if you have a break point, how this will impact the internal sequence of a read. So read pair is really the two pairs of one of the other, the split read is inside the same read, or you have a separate method, which is the assembly. So read pair, how it works. So the idea of the read pair, you identify the break point of your structural variant by looking at your alignment between your two reads. So when you're done your sequencing library, you have created your library with a specific fragment side. You know you are around 400 base pair, for example. And if you span a variant, you expect that either your read will be, if you will be too close or too large based on what you have sequenced. So the idea is to look at this variation of the intersize to determine the structural variant. The main limitation of that is the sensitivity. And the sensitivity is based on how you do your size selection. If you have, so how you use size selection, I mean, what does standard deviation and the size of your read and your intersize. So when you look at your intersize, you look at the distance in your, the distance of your fragment, the size of your initial fragment, the size is supposed to be there. And you look after mapping. So you've got this kind of Gaussian distribution around the mean of your intersize, what you are supposed to have selected. And you take your distribution. So if you have the wider structural foundation, the wider your distribution will be like that, okay? So it's really important to have a tiny distribution. So what you will see is, okay, I've got my distribution of my read. No, I think that, for example, two standard deviations will be my concordant distribution, my blue distribution, and everything that is out of that, the two-tail will be my discordant reads. So what I do for that, I take this discordant read and then look where they are in the genome, and I try to find cluster of reads that are discordant and that are clustered at the same location. And that gives me several evidence of the same discordance. And then I will call my structural value. So how it works, if you have a concordant read, you've got your sample genome here, you've got your reference genome. So you expect to see in your initial fragment the same distance in your reference. Now, if I have a deletion, so if I lose part of the reference genome, my fragment is always the same. It's always the same time, because I select my set side. But when I map it, as I lose a fragment from the reference, my two-read will map really with a large insert size. Same for an insertion. So I've got my fragment which is the same size, but I've got a fragment that is not in the reference. So we see the fragment that will be mapped to close. So we can do it really easy for insertion deletion. Insertion is a bit more tricky, because due to the size of your reads, usually you have a really low resolution and it's hard to do insertion. It's better for looking at deletion. It's easier, because here you are limited by the size of your fragment. If the deletion is too big, you will not be able to map your read onto the genome. So now, if you have a tandem duplication, what you will see, you will have two copies in your sample. And you will have a fragment that will go over the two copies of your fragment. And what you will see, you will see this fragment that will go over the two copies, the read, the repair, will match in a larger way, but at the opposite direction. So you will have normal read there, because you have normal fragment here. You will have normal fragment here, but you will have some of them just pop up and tell you you have the tandem duplication. Inversion will be quite a similar pattern as this one. You will have read with larger inter-size, but only one of them will be inverted. But you should expect to have a set of evidence here and a set of evidence between that one and this one. You will not only have one break point, but you will have the evidence for the two break points. Now, if we look at tandem duplication that's already done, inversion. So you have the second break point here. Now, if you have a large insertion, it will be more complicated. Because if you have insertion from a other distance, you will have a signal that will be not sprayed on the same genome, on the same region. So it will be hard to differentiate between translocation and the insertion. So the way we do to detect that is we expect to see two sets of translocation that launch to the same region, and that's past two other locations. So this is how it works. So really simple, but with the limitation of the size of the reads, the limitation of the standard deviation of the size selection. So to do that, you have many tools to do it. Like Breakdown Third, one of the first tools that have been released, and a lot of the other tools. Today, we will use DELI to do that, which is a really great tool. You will understand what after that. So just to tell you, when we detect structural variance, it's in the population. When we do apply a method, before we find, we find a lot of them. So the method to detect structural variance with a read pair are really efficient, but they also provide a lot of false positives. So you will need to do a lot of filtering of your data. Another limitation with a read pair is when you have a complex region, like this one, you don't know what you have. You just have crazy evidence of many events, and you cannot really make sense of what you have. So just to summarize a read pair, the weakness of the method is it's really difficult to interpret when you have a complex region, which is most of the case in repetitive region. It's difficult to characterize early re-arranged region, and you have a right-high trait of false positive. The strength is, theoretically, you can detect all types of SVs. The second approach we can use is a split read. So the rationale of the split read is you take reads, and you look at reads that have been break into pieces. And then you try to make sense of these reads. So you try to see reads that break into few pieces, one, two pieces, and see how they map together. And you try to find cluster of evidence that show you the same pattern, and you can call the variation. One of the major advantages here is you are really looking at the break point. So you are really precise. And one of the people that are working with split reads really recommend to do longer reads, to exercise the longer read, because you have more chance to split your reads. Because if you split your reads at the one end of your reads, if you don't have enough reads in one of the parts, you won't be able to map it. So you will just cut your read and don't know where the other part of the read will map. So the longer you read, the better the efficiency of the split-read method will be. So how it works, you take your read, you align. Most of the pair align correctly, but some of them map usually one end correctly, and the other are breaking two pieces. And you take that to do the calling of the variants. So what are the signatures that you can observe with split-read? So here, it's the pair done mapping. So what we see in the previous methods. So if you have a deletion, you expect to see a larger indel, but you expect also to have a read that in your donor is like that, is one sequence. But in your reference, the first part of the read will map to the genomes, and you will have a break, and the second part will map later on to the genome. For insertion, you expect to see in pair done mapping to have, in read pair, to have the intersize, which is shorter. And in the split-read, you expect to have one part of the read that is mapped, one part that is deleted, and the third part that is mapped to the first one. So it's where you see here, for that kind of signature, that the longest your read will be, the method will be able to detect this kind of event. And it will also give you a more resolution to take larger insertion. Then when you do inversion, you will have the signature for the pattern mapping. So you will have the inversion where your reads are large insertion and inverted, and one is inverted. Here, you will be able to see read that map, a cluster of read that map at one end, and that are inverted, the read position in the other, and the second break point. So as you can see in that presentation for the split-read, I always compare, show you, split-read versus pattern mapping. So because now the tool, the new tools, is the way to analyze the data is to mix both signals, is how the new tools are working. So it's important to not only use split-read, but to use split-read and pattern mapping at the same times. Because if you do split-read, it gives you also a lot of false positives, but not the same as what you have in read pair, and it's really slow. So a lot of the tools will go first with read pair, and then add the split-read on the action as an additional layer to confirm the method. The advantage is you can have better detection and have the break point detection at the same time. So the split-read tools is this one. And you see, oh, we have the same daily, which is a tool that do split-read and pattern mapping. It's why we will use this tool. Another really good tool to do it is Olympies that do both of the techniques. So summary of the split-read strength and weakness. So in terms of strength, it works really well in pair with the read pair methods. It gives you best pair resolution, and it can detect a really short insertion that you cannot have the read resolution to detect in pair and mapping. So if you have a really short insertion of deletion, read pair you will be under the standard deviation. So it can give you a better resolution. Weakness, it needs more coverage than what you have in the read pair. And there's a lot of false positives. Third method is the read depth. So the read depth is mainly used to detect insertion deletion. It's the traditional method to detect so copy number variation. So how it works, it's based on an assumption that you have an homogeneous distribution of your read along the genome. And if you have a copy number variation, if it's a deletion, you will have a decrease in the amount of read you will see at the position. And if you have an increase, if you have a deletion, you have a decrease. So it works really well, except that this assumption is not always true. So how it works, you divide your genome into a bin of equal size. So you can choose your bin. It will really, this choice will have an impact on the resolution, but also an impact of the noise and the false positive. Then you estimate the depth of coverage in each bin. And then you look for a cluster of constitutive bin that show you either the significant excess or loss of coverage. So it's where the NGS meets the methods of CGH array, which are the measure of depth. And you do exactly the same. You do segmentation of your genomes. And then you look at the variation of coverage. So how it works, you have your genome. If you have a deletion, you expect to see almost no reads. If you have a notification, you see excessive reads. In terms of what you see in IGV, when you have a good event in a normal sample, a sender is really clear. You can see that you clearly see you have your general mean coverage and you have this excess of units. So it worked really, really well. What you need to do is to correct for GC because you have a GC bias of sequencing and use any segmentation tools like we do with CGH array. So just to give you an example, this is a cancer sample. We use it on cancer. It is a tool that I developed, which is called SCONS. So you have your cancer genome, your normal genomes. And you just do the ratio of the view and you plot. And you see it's really in case where it works well. Because in cancer, it's really often that it's not working well. The pattern are really evident to call. But here, as you can see, when we look at the data, what we tend to do is to look. I got my genome. I got my point. And I tried to look in a point that are different from the rest of the genomes. That's a major weakness of this method. It's because it's how you normalize your data. So most of the data makes an assumption, as I say, that you have a much-needed coverage. So this is what you expect as a signal of copy number variation. You expect your genome with a constant homogeneous coverage. And then you have a region with an excess of coverage for an amplification. Now imagine you have this point. It is really an amplification. But instead of having this homogeneous coverage, you have this coverage. Trying to find that this is an amplification becomes really, really more tricky. It's because we are not, as I say, we don't have the assumption that is almost homogeneous all over the genomes. Several ways to avoid facing these results, taking large-housebin. If you take large-housebin, you will normalize the coverage you measure. But you will lose some resolution. So there is another method that have been developed in the lab at C2G, which works fantastically to avoid that in one condition to have another sample. So the idea of this method, which is called Popesville, it's instead of normalizing your count over your neighboring bins, you take a population of sample and you normalize vertically. So you're looking for this particular bin in several set of samples in the 10, 30, 200 sample. What are the variations of coverage for this bin in all the sample? So you've got the local variation and the distribution into your set of control sample. And in red, you've got the distribution in your sample. And then you are able to take into account this local and non-nomageal use distribution of the bins. So if you have enough free, I would say that this method beat my method for sure. If you have enough sample to work on that, yeah. What's the minimum number you think that you have to be valid about? That's a tricky question. It's a PhD student that developed this method in the lab. And I still say, you need to assess the minimal number. And we don't have a minimal number, but from experiments up to 10 to 15, we're starting to have really good results, especially in a low-mappable region and in a repeated region. You said that the bin size, if you have a right bin size, it's going to work better, but you might miss some information. So if you do some kind of estimation with an automatic estimation of your camera size, does it improve the results? Yeah, it could. But the thing is there's really that local variation. If you look at the coverage, if I see the images of the application, you see it in every 10 basis, the coverage could be clearly. So the channel will probably have trouble to find the right size. Because it's a local region will have not some variation, some other not. Because you could have a local estimation of your camera size. Right? Yeah, probably. Right, your tools. Yeah. In hyperploidy, usually hyperploidy is when you're missing cancer, or you talk about hyperploid organism. So in cancer, usually what we do, we estimate the ploidy to be around before running two percent. Yeah, like we use a tool like Sequenza or Buttonberg. We estimate the look at how the variation, the data frequency of the region, we estimate the frequency of the ploidy and the ploidy of the sample. And then we use that to, as a parameter, in the copy number analysis. Is it the same sequence machine, exactly? Everything has the exact same parameters starting the sequencing so that any variation from the sequencing is from that sequence. Yeah. It just pulls available needles off. Yeah, you need to have homogeneous control, as usual. Otherwise, it will work for a lot of things, but some means will be crazy. So the different tools, so many, many tools, not so many tools that are dedicated to cancer. And most of them do a good job. So most of them do a good job in a normal population. It's where you go to cancer and repetitive regions that it starts to be difficult. So the summary of the read-depth approach. So it's a relatively low resolution. The breakpoints are ambiguous because your breakpoint will fall in a bin. So you know that your breakpoint will be approximately in that range of bins. So depending on what the size of the bin could be, if you have one kb bin, you know that your breakpoint is in one kb region. So it's not easy to detect exactly the breakpoint. And you cannot detect a balanced rearrangement. The strength, fast and simple, it's really easy to interpret. As I show you the graph, when you look at the graph, you can really say, OK, I've got to send it there and there. And there's a lot of methods, not a lot, some methods that do machine learning in normal to do that. It's work well in normal, it don't work in cancer. Because the problem of cancer, it's simple, it's specific. It's simple, it's on purity, it's on cellularity and everything. So the strength is to determine the copy number. So you can really tell you how many copies you have. And with new methods, you can have also information in the repeated region and low coverage region. The last approach you can use to do SV, Structural Variant Detection, is assembly. So the idea of the assembly, you have your genomes, you don't care about your genome, you take your reads and you reassemble them. You will see this afternoon how it works to do the assembly. The idea is to create a new reference specific to your sample. Then you take these long contigs and then you map this contig to the reference. And you see, oh my, I got one contig in one pieces and it's breaking a few pieces. So you know that you have deleted the region. So the idea is really simple and efficient because it gives you really large sequence, large fragments that are easier to match. And there are two different approach. The one is to do a world genome assembly and the other is to do local assembly. So world genome assembly works better, but in terms of computing, it's really intensive to compute a world genome assembly for humans. So just to tell you, for example, at BC, they do a file. They detect the Structural Variant. They use a world genome assembly, but they have a cluster that is only dedicated to that. So they have 3,000 CPUs only dedicated to do world genome assembly of humans. So you need to have a lot of resources to do that. So it's why this local assembly approach has been developed. So how the world genome approach works, as I say, you do your world genome sequence assembly and then you compare. You do BLAT, you group your scaffold and the same chromosome and you look how it aligns. So it's really well because it really brings you everything that is the novel. How is the local genome assembly works? The idea is you do your mapping here and you will have different. So if you have, for example, an insertion here, you will have the reads from the fragment that come from your insertion that won't map. So it's what we call the orphan. So it's the two paired unmapped. And then you will have the read here in red in green, which is called OEA, which means one end on current. So that means that this read has one read that map correctly and the other read don't map. So you take this read, this two set of reads, orphan OEA, and then you do the reassembly on this subset of read. So you really reduce what you need to assemble. So you really reduce your compute needs to do that. Then you have your context. And what to do? You take your context, you take the OEA that should be at the end part of your context and you look where this OEA encore with the read that you map in the genomes. And that tells you where the sequence and the new SV is located in your reference. So it's kind of a signature you will see. And it works well. It's also intensive in terms of compute, but it's well. And there's a new tool that did that really well for the local assembly. I didn't update my slides because I found it one week ago. I will look in my notes, but there's a new tool that works really well to do this local assembly. Here's a typical denovo assembler you can use to do the assembly. So the tool is not there. So Cortex, SGA, Discover, Abyss, Ray, and so on. So what are the strengths and the weakness of this method? So weakness, computational, computationally very intensive. So especially for a world genome assembly. And sometimes it could be hard to resolve as a result of your blast. If you are in repeated region, you will have your contigs that could map different regions. So you don't know exactly what is the region. The strengths, you could have best resolution of the break point. And you can detect every type of SV. So just to give you a summary of the methods. So you've got depth of coverage, but then that leads to the treatment of assembly. So it's the resolution you could have. So in terms of speech read and mapping, have the higher resolution, but the higher difficulty and cost to do it. So it's always a balance to do to choose. A summary of the SV methods. So you have four methods that exist, either as its own strength and weakness. So if you are interested in the one specific type of structural variant, you could adapt and choose the methods that feature data. The most recent tool, no combine, several methods. So Bumpy and Delhi, I really recommend you to use this tool. And the major challenge to when we do SV is to understand complex events, because you will have several set of goals that will be in the same region and to really understand what you have. It's sometimes really, really hard, and you cannot really resolve that. To really find the break point, because when you find the break point, if you want to validate your data, you need to have the real break point to design kind of PCR or the kind of thing. Now in terms of what you could expect to see in IGV. So this is the type of deletion. I see we saw it one yesterday. This is what we saw in terms of duplication. So it's a ton of duplication as we saw the read are inverted and with large intersize and excess of coverage. This is what we saw for an inversion. So you have two clusters of reads, that cluster together. This one and this one and these two other. So this is your insertion. So you've got this read, some of the reads that map there. And you've got all this new term, new code. And what is a typically view that we do for a virtual variant is the circle spot. So usually you have different tracks that tell you the point of your events. And you've got the translocation of your genome. So if you do consider, you will probably have seen this time several times. So CirCos is a really good tool to represent your SV. But if you use the real CirCos tools, it gives you a really high quality graph. But it's a bit complicated to use. There's no new solution to do that with some package like CirCos to generate this kind of graph, which is way more easy to do and way more oriented to genomics. That's it. Do you have any question? Most of them could work with targeted data except the read depth tool. But not purposely, purposely will work because the problem with targeted data and the read depth is your coverage in targeted data in the example of the protein is linked to how many bits you have, how many probes you have. So the way you're here, so you've got your exam like that. And how you do the targeted, you do some probes that catch this region. OK. And some exam will have a lot of probes, some other really few. So you cannot. So all the methods that normalize horizontally in the same sample will be biased by the number of probes you have in each exam to catch this sequence. So it won't work. So some new tools have been developed to try to take that into account. But it's not working so well. It's only when you have exam, it's better to use a more a population approach. Like for this day or there's another tool, try to remember the name of the tool. I think it's better. So which one it is? No, it's not there. So there's another tool, I don't remember the name, that is doing also kind of population approach. The exam is better to normalize vertically your read depth because every sample for this exam has the same number of probes that I've used. For speech read? So you take your read. You look at the reads which have one read that match normally and the other which have been cut in two pieces. So you will have one part which is called primary alignment and the other will be secondary alignment. And then you look how it will be marked in the background what are the reads that are cut in two, three, four pieces. The longest one will be the primary alignment. And the other will be secondary. So just the way it is called. And you look where the secondary alignments are positioned regarding to the first one. If you are positioned at another location, probably you have a position. If you are positioned at the distance of x number of spaces, you probably have lost the seconds between the primary and the secondary. So why do you need to work with the two? Why do you force the approaches? It doesn't matter. Of course, to be in the middle of it, one, it's a two or three. And as I said, I watched you not for one read, but you tried to look closer of reads that show you the same pattern of reads in the same location. So you look closer of the same parameters. The primary could be at the two, at the two break points of your polynomial, which is a break point, arrive at a different location. If you read, the primary could be that one or that one. So you just look like the link between the primary and the secondary reads. And the cluster of reads that show the same jump. OK.