 module 5 structural variant coding so it's where coding variants start to become harder so we see that it was quite easy for SNP, in DELS it was a little bit more tricky but for SNP it was not so bad. So in the structural variant coding, I will first try to understand what is structural variant because not everybody knows what is a structural variant, how we can find them, there are many ways to do it, how each strategy was a strength and a weakness of each strategy and then we will go quickly in a loop what are the signals of structural variants in your data and we will explore, visually explore the ESP. So what are the structural variants? So it's dynamic rearrangement, so this trailhold is something that has been chosen but it's not something we fix, it's structural variants larger than what we call in DELS, so some people who say DELS are structural variants or some other people know, so mainly everything that is more than 50 base power. So this includes deletion, insertion, inversion, mobile event, transposition, duplication, transposition. When people start to look at structural variants, NGS was not existing in this technique. So the way they did it was doing either by carrier typing to look at events like this one where you have transposition or to do fish or all this kind of technique. So you can imagine in that cases the resolution and accuracy was not really good. So just an example in cancer, structural variants are a really important feature of your genomes because in cancer cells you don't have selection pressure so your genome can sometimes become quite crazy. So it's why sometimes you see promo-tripsis or this kind of feature that your genomes is breaking in many parts and rearrangements. So this is an example of cancer cryotype, you see that a lot of chromosomes have copy number change like this one, a lot of others have mixed two different types of chromosomes. So it's why when we start to study the genetic of cancer with this kind of study needs higher resolution that these techniques can provide. So the different classes of SV are the copy number variation, so large deletion and duplication, the copy-notchal rearrangement in version, translation, and the other type of structural variants, so novel insertion or transposition of a mobile event. Just to give you an idea, when we talk about structural variants we always talk about what's happening in the sample when we compare it to the reference because if we look at the genome of the sample, it's a deletion if you compare it to something, so you always need to compare it to the reference. So this is deletion, you use a part insertion, you have this section, so insertion of a mobile event, so transposition, tonal duplication, interpressant duplication, inversion, so this part has been inverted and translocation. Everybody's okay with that because if you don't get that, you will be lost for the rest of the presentation. So as I tell you, the technique to detect a structural variant at a volt at the beginning, like 60 years ago, we were doing caliotyping, then we were doing fish, then we started to do microarray, especially for copy number variation, and now we are at the high throughput sequencing and the high resolution. Just to give you an idea of how the technique evolves, when you start from when we were doing CJ charade, we were just able to detect the variation of copies, but not this kind of complex, more complex signals, and when the technology starts to evolve, usually it gives you more information, more signals, so you can be able to detect a different type or more complex type of events, but it's more complicated to do the coding. So CJ charade was really easy, now SNIV using NJS is way more complicated to do the variant call. Just to give you an idea of how we can detect in a general manner some structural variants or some variants in general, so here you get a point mutation how we can detect with NJS, so you look at variant reads, in-delts, you look at missing parts of the reads, large deletion, you look at regions where you don't see or less read, duplication, where you accumulate more reads, and structural variants, you will look at what is complicated with specific structural variants, that you will need to have reads that give you specific patterns of the specific structural variants, so here I'll just give an example of a transformation, but each type of structural variant will have a specific signature to be identified in your genome and in your mapping. So different strategies we use, we have four strategies that we can use to detect structural variants, so the read pair, so the read depth, I will give you, I will go in detail in each meta, so don't worry, the split read and the assembly approach, so the read pair, the idea is to identify structural variant breakpoints by examining the alignment of your reads, so when you do your sequencing you know the size selection you have made on your library, and you know that if you fragment, span over the breakpoint of a structural variant, that will affect the inter-size or the orientation of your reads, so the read pair is really focused on examining clusters of reads that show abnormal inter-size or abnormal orientation to detect the structural variant. This technique is limited by the fragment size, so the larger your fragment size will be, the larger your sensitivity will be. So it's really based on the inter-size and the orientation which is bad in this graph because it should be, the tool it should be, face one of the surgery. So what does this type of method do, so they estimate the distribution of your inter-size, so what's your means, and then based on the distribution they estimate the standard evaluation of your inter-size, then you choose which level of sensitivity you want to do your calling, so you say, you choose how many standard evaluation I consider as a co-cordombed best pair, so 2SD, 3SD, and then everything that is out of this result is, co-discordombed, and then you have this flag about your read, my reads that are discordombed because they are not the other, and then you give this read to a cluster because you could have discordombed by chance, in that cases you will find discordombed pairs spread everywhere, or discordombed by biology, where you see many reads that accumulate at this location. So to give you an idea, when you have co-cordombed reads, you have your reference genome, you have your test genome here, and you expect to see your read in the correct orientation in the correct distance. If you have the deletion, you expect your read where it starts to be triggered in the line, if you have deletion, you expect your inter-size to be larger, so because you, in the reference genome, there are the pieces that you have missed, so, okay? Because what you have to see is, in your test genome, the size of the fragment is correct, in your test genome. Just when you replace it on the reference genome, it starts to be discordant because something is different between your test genome and your reference genome. The same way, when you are spanning an insertion, you expect to see your inter-size lower because there's a fragment that is in your test genome that is not in the reference, so they will collapse the read together. Here you can see what is the limitation of your data. If your insertion is too large, then you will not be able to have read that span over the insertion, and you will do it. So, well, having longer reads could be an advantage, or having made parents. When we have done them duplication, what we'll see, we'll see some reads that are in the reference genome with large inter-size potential in the orientation because this read here will just catch the junction of the two copies. If you get an insertion, you will have, on one side, a large insertion with inversion of one read only, large insertion, and inversion of the other read for the other read point. So, if you've got more complex, if you have this insertion, so from another region, you will have reads that span from one location to the other, and from the other end of the inserted region to the location. So, it's sometimes hard to, if the B section is too large, it's sometimes hard to make a difference between large distant genomic insertion and translocation, because that would be the signal of, if you take only this pair alone, that could be the same signal as you see in translocation. But if you've got the other signal that tells you, okay, this is not a translocation, this is an insertion from another region. So, we start to see that the more it becomes complex, the more difficult the calling and to make sense of what you see on your alignment. So, this is how it works. This is a really non-exhaustive list of a read pair caller, and today we will use daily to do the calling. You will understand why we use daily, because it has main advantages, and I will tell you later. So, just to tell you, when you do repair calling only, mostly you will detect deletion, because it's a major part of event, and others really trouble to define what is the other type of variant. And you see, there's no, so this is all the code. What's the main issue with this method is when you face complex region, like this one, when you get paired with different size, with deletion, understanding what's happened is not easy. You are able to flag that. You have an event there, probably multiple events there, but you're not able to understand to make sense of it. So, what are the strengths and limits of this method? So, the limits, it's difficult to interpret when you are in repetitive genomes, because the mapping will not be good in repetitive genomes, so you could have bad mapping. So, if you have bad mapping, you have bad signal. It's difficult to characterize complex region, and you have a high rate of force positive, because when you choose the trill valve, sometimes you could choose to smaller to ride. One of the strengths is that almost all type of variant could be detected. The second method, the split-read. So, the idea of the split-read is to identify read that contains the breakpoint. So, here we have, when we use the paired read, we are trying to find fragment that contains the breakpoints. Here, the split-read tries to find reads, so small part that contains the breakpoint. So, the idea is when you have an event, you will have read that will be correctly mapped, because it's mapping one of the regions, and read that it will have trouble to map, because it's directly over the breakpoint. And to catch these reads that have trouble to map, and to understand why and to split them. As for the paired read method, having the long read is better, because you have more chance to go over the breakpoint. And also, because you could have multiple breakpoints in complex region, that will be on the same read. So, now a lot of structural variants are done, for example, with long-read technology and with synthetic long-read, like 10x, we try to do prototypes and see, as we are able to mark the read of the same prototypes of the same fragment, we are able to understand a little bit more of the complex region. So, how it works? You have your read, you align. So, most of your read will be aligned perfectly on the genomes. And some of the reads will have trouble to map one of the two reads. So, either it will be only part of the read that's going to be mapped, or the full read will be aligned. But usually, we focus on read where the read has been split, break, to map only one region. And we have the other region that is either mapped elsewhere or unmapped. So, the idea is to detect these kind of signals. So, if you're talking about a deletion, you look at your reads, you've got one pair that is correctly mapped and the other. So, if it was read pair mapping, you will have seen a large example here. You see reads that are aligned at this point. And you look at the other part of the read. So, the method will break the read and realign the read over close to the proximity of the first one and try to find another part of the read that could map the region and find where the deletion is. In terms of insertion, it's the same way you expect to see two parts of your read. You go in three parts and two of them stack together. So, you can clearly see here the limitation of that approach. If the insertion is too long, you won't be able to find the other read again. It's wired now for structural variants. We tend to use a lot of long read or net pair or synthetic read to get this information. Now, if we look in terms of inversion, as you remember, it was for read pair also is also called pair and mapping. So, the read pair approach will tell you the inversion of the read orientation and high intersize. Here, you expect to see a read that is split here and going and mapped to another place with an inversion of the end of the read. So, this is all this kind of signature you can explain, which are different of the read pair. So, as you can see in all this figure, I've always told you to use the split ring versus the read pair because if you charge the water room for split mapping, it gives a lot of false positive because you cannot split because of a lot of reason and so. So, now the split mapping alone is not really used, but it's reduced in combination of the read pair. So, the norm now is to use combination of both. So, the advantage is that usually you don't have the break point with the panel, with the read pair, whereas the split ring can be given the real break point at the base level. And it's also better than the read pair in terms of small variation because when you repair, you need to have enough standard deviation to detect the division, where when you have a read that is breaking two pieces, averages a division of five, four bases, you are able to catch it. So, the split with tools are this one and you can see that some of them are also tools that do repair. So, the two main tools actually for structural variants that I recommend and that we use is Delhi and Lampi, which are the two tools that use the BOS approach. And Delhi also provides also additional features as the one we would use. And this water is Tobias Roche in Germany and he's really a nice guy. You write to him and you are usually in the next two days, you are not served to your question. It's really nice guy. I'm teaching with him at another workshop. So, the weakness of the method is you need to have sufficient coverage to have read that to span your break point. And you could have a false positive in a mapelle region. The strength, it could work in addition to the repair. It provides a base resolution of the break point and it can detect very short events. The next method is the read depth. So, the read depth is based on a simple rationale. It is when you do your sequencing, you expect to have a nomogenous representation of your fragment all along the denotes. So, when you want to look, so the read depth is dedicated to look copy number variation. So, when you want to look copy number variation. So, if you have read that are homogenous all of those genomes, if you are deletion, you will have less read on the region. If you have duplication, you will have more reads because you have more fragment of DNA of this region. So, the way it works, you take your genome, you divide your genome in bin of equal size. You estimate the depth of coverage in each bin. And you look for cluster of consecutive bin that show a significant excess or loss of your depth of coverage. So, it's really simple approach. It's really based on what I've been developed with CJHRA 10, 20 years ago. So, it's really simple So, just give you kind of overview. And, oh, we don't see him. So, here's an example of what you can see. So, if you look at your coverage, the general coverage, you have your general coverage and you have got your amplification. We see the amount of coverage and then we see the amount of read. This is another example. So, in concern, what we use, this is an example of one of the tools that I've developed, which is called scones, which is dedicated only to cancer samples. So, you've got the depth of coverage represented in your tumor, in your normal and the lower issue. And you see that it's really worth to detect this kind of event. So, large chromosomal variation. But there's a problem with this method. The problem is that, so, the problem, if we go back here, I don't know if we see that you see your coverage, it's not homogeneous. So, the main assumption of the method is wrong. It's not wrong. That is, it's true for most of the parts of the genome, but not for all. So, in general population, when you do copy number, like for neurodegenerative disease, where you have good quality, it's worthwhile. When you start to work in cancer, the variation is usually slightly, because you have a separateness and variety, of this event. So, if you have a variation and your variation, you are looking at a smaller, makes a difference between the general variation and the biological variation could be tricky. So, what we think, what the method think the coverage is, is like that. You have been with equal, with equal depth of coverage, yeah, been with equal depth of coverage. And when you see an event, you have a variation of your depth of coverage. What the reality is, like that, you have been with variation, with natural variation, because some sequence, some dynamic, are more complicated to catch, some, some dynamic sequence. And so you've got variations, some references, these things. So you've got all this natural variation. And then you try to find this bin along this one. So, you see, could be quite tricky to find. So, in a perfect world, when you have enough coverage, you start to limit this variation, catching all the complicated regions. It's good. But if you in the hard to, in the region hard to map, if you are not enough coverage, it could be really tricky. So, there's this tool that is developed in older by one PhD student. I have to say, I have one tool for concern, but this tool keeps the rest of my tools at one condition. If you have enough sample, if you have more than 23rd sample, go with this tool, concern the concern, we keep the rest of everybody. We don't, we are still telling you need to give a real number, because usually we are, we say, but you don't have time to, you don't have time yet to, to do the benchmarking to see what's the limit of the detection. But us, we said 20 to be safe, but you already ran with 12 or 13 sample. So the idea of this tool is when you can imagine you have this variation, imagine you have a set of, so this is genomic windows in one sample. So all the methods that we use do normalization, horizontal carry between the beam. So now imagine you can stack this graph in each sample. And then for each beam, instead of normalizing horizontal carry, you will normalize vertically. So you will take the same beam like here, and you will take the same beam and normalize your data over every sample for this beam. And you will see what is the natural variation you could expect for this beam. And you do it, and you have the natural variation in your court, and then you compare your sample, and you see your sample is in the natural variation, and then, oh, you are out of the natural variation. So then you are in a copy number. So it's Jean-Marin that published that, and it's available here. So, yes? What if you have the copy number variation in a significant proportion? So for sure? Say 30%. So then if you're normalizing that copy for me? Yeah, so you will call probably, so 30%, you will call, you probably will call it, correctly. But if it's the opposite, if you have like 60% of your sample, you will probably call the one that don't have the variation. Just to give you an idea, we use the same approach to do a single cell recognition. So we put a set of single cells, and we want to know which one are real concern, which one are continuation from normal, your tissue. And we do it by copy number. So we're able to display the two groups like this, by this metal, and then we need to dig to find which group are the concern, which group are the normal. But it's really efficient to make the difference. So except if you have like 95%, I would say it's possible. So the tools you can use, there's a lot, a lot of tools. So this is mine, this is Popesville, this is older. This one is also one that tried to use population information, but the way is really, it's not as good as Popesville. And you see that many tools are for general copy number, as there's not so much somatic dedicated tool, because it's more complicated. So people create the tool for the digits with this one. Yeah. You cannot make the difference. You can say there's duplication, but you cannot make say if it's standard or interspersed duplication. Exactly. If you combine with a pattern, so actually Popesville, the author is working on adding this initial information to have a better typing of the event and to also work on how to define the break point. Yeah. It's only for copy number variation. So what are the strengths and the limits of this method? So actually, we still have lower resolution of the copy number event, 5 to 10 kV, as we are able to go to 3 kV. So the maximum resolution you will use, the higher level of knowledge you will have in your signals. Because to have a high resolution, you need to cut your beam in a really, really small size beam. And you face more dynamic context variation of your sequence. So you have to increase the variation of your signals when you have small beam. So us when we work, we usually work with beam of 1 kV. And we say, okay, to be a copy number, we need to have 3 to 5 consecutive beam to see the same signals. The break point are ambiguous because the break point, as you work as beam, with 1 kV, you cannot make the difference of the break point in the beam. So now some tools try to solve that, doing sliding windows beam to refine the break point. But it's really more complete time to do that. You cannot look at the long-term rearrangements. I think I did not change the quantity of DNA. But the strength is it is fast, simple, it is easy to interpret. As you see the graph, when you see the graph, when you see the additional notification, it's really easy to understand. It gives you a copy number, which many tools, many methods do not give you a copy number. It gives you even, but not how many copies you have. And if you use BAPSV, it's able to catch the data with low coverage, low mapping, low beam mapping, and with low resolution. The next method is to do assembly. So the rationale is why should I have to map my reads to the genome, where I know that I'm looking for events that are different from my genomes. So I know that my mapping will be hard because of this event. And especially because the side of my reader are small. So the event could be larger than my reads of my ORR mark fragment. So the rationale between this method is to do, okay, forget about mapping at the first time, just do assembly of the sequence. And when you generate your contigs, you have a large sequence, then you can map to the reference genomes and see how it's different from the different genomes. And you will have more chance to resolve the complex region or larger events. There's two approaches for the assembly approach. The ones that do local world genome assembly and the ones that do local assembly. So if you do a world genome assembly, so it's what I tell you, you take, forget about the mapping, take all your read, do the level of assembly of your genome. When you have your read, your blood sample, your gut, how you scaffold, how your contigs work together, how you split your contigs. And a little bit of a split read approach, how your contigs split into the reference and you're able to determine the structural variant. It has a main advantage that it's really working well to catch the no point discussion and deletion events. Doing a world genome assembly of human, it's time consuming and resource consuming. If we go, I'm collaborating with a guy at BC center, at the BC center in Vancouver, they do it, it's a way they do for structural variant, but they have only, they have one person that is only dedicated to that. So you need to because it's really a resource expectancy. Can you just do the actual alignment, because the bulk of the sample looks just aligned flawlessly. You can then use that as a scaffold to then reassemble the genome based on the non-aligning elements to you. Exactly. That's the other thing. The local genome assembly. So the local genome assembly, you map your reads, your gut and blood insertion, and you will end up with read that map to the genome. Normally, you say I don't care, reads that didn't map to the genome because for the blood insertion or a sequence that is not able to map. And you say, so what to call the orphan, so the two reads un-map. So you take your orphan, your un-map reads, and you will end up with what we call one end on correct reads, which means you have one read that is mapped correctly and the other that is un-map. And you take also this two reads. Then what you do, you take this subset of reads, which is really like depending on the amount of copies of the structure you have and insertion, but it's really like few percent of your reads. So you have really reduced number of reads and then you do the assembly. So you don't expect to have some large sequence that is really much easier, really much feasible. And you obtain your contigs. And then what you do, you have your contigs and you map the one end on correct reads on your contigs and you expect to have this read to be at the edge of your, of the contig here assembled. And then you are able to encore where this contig goes in your genomes based on where the other reads maps. And then you are able to resolve the structure of the read. So it's the method that is now more used and it's really working well for insertion and all this kind of analysis. And then you just need to blast your contig. But in both of those cases, you still have, you know, you're not actually, if you have tandem repeats, you can't actually assemble a tandem repeat. You still see just pile up and then you have to predict that must be an advantage of the solution. Exactly. So you will have, unless you have a very long read that would read through the entire tandem repeat. The problem with tandem repeat is that probably you read wants be a map. So you won't catch it in that method because your read will be mapped with specific orientation or exercise. So this method won't get it. Whereas the four assembly problem will have better resolution for that. So this is the pattern you go through. It's quite similar to what we, the pattern goes here for split-read. But you can imagine that we have segments of KB loans or more. So it's really more easier to detect this pattern. So this is the different tool you can use to do assembly. So there's, at the time I prepare, I don't know, I don't check if there's one tool that is working to do that on his own. So most of the time when I do it, I do it by myself. So do the different steps and then do the assemblies. That's why I tell you the different assembler you can use to do assembly. So you've got Cortex, SGA, Discover, Abyss, Ray, and a lot of others. So what are the strengths and limitations of the assembly method? So it's computationally very intensive or either the local assembly is still intensive. And it starts to resolve repetitive and complex region because you'll read well, the assembly will break on units. The strength, you have a best resolution of break point because you have a real sequence. And if you do the world genome, you have in theory almost all the class of variation you can see. And especially the small relation or the small insertion. Just to give you a summary of the strategy, so you've got the four different depths of coverage, pattern mapping or read pair, split read, and then the assembly. So the resolution, the difficulty is low. When the read difficulty is low, usually your resolution is low. So when you want to look at large events, you take this method. The more resolution you want, the more work and difficulty, cost, and noise you won't face up. So a kind of summary of the S3 detection. We have four different methods. The more recent tool, no try to combine different methods. The E-Lampi, Sun, PAPSV for the break point. But whatever the method you use, there will be, it's really challenging. There's no perfect method to do that. When we do, when we call SV, we will use many tools as we do the kind of assemble approach. We use several colors and try to see what is the color between the several colors. We use several approach. And what is complicated is to resolve, to not to do the code, but sometimes to understand what you have, what your color have output. Because many, many events will be marked as unknown type of SV. Because you have complex region and it's not easy to make, to understand this region. Challenge is to find the break point, exactly. And the validation. Because validation is easy. You put two PCR probes at each border of the deletion and you see if you quantify the region or not. Because if you have a larger one, you will see the side of your frame. But when you start to want to quantify a large insertion or large event, it's complicated to validate the data. Now, in terms of visualization, this is the type of visualization we saw. We already saw it yesterday with Florence. So when you have an IDV, when you have a deletion, you have with that span and with a large intersize and you clearly see generally a drop of coverage in between. And duplication. So in terms of duplication, you have the read with large duplication and which means that probably this fragment is added to the other side. In version, you have this read that runs to this one, this read that runs to this one. So here, I should have put, when you ask IDV to put read in the purview, it will be easier to understand the insertion. You have this amount of code that goes from here to other place. And usually when people, when you do cancer, when you try to visualize SV, what we use, especially for cancer, we use a circle spot. And you have some new reference genome, you put your coverage and you have point for different type of event and you've got transportation that shows you where some read junk from one place to another. And that's it.