 OK, so last model for me, model 5, variant calling, but this time, structural variant calling. So what I hope you will learn today is to understand what are structural variants, to understand how we discover, we try to find SV in SJS data, so in next generations within data, and how difficult it is. So there's no perfect method to do, so there's many approach to do that. And I hope you will understand what are the strengths and the weakness of each of the strategy I will present. And then at the end, we will look quite rapidly to the structural variant signal in IGV or in your reads. And on the practical, we will call some SV and explore some SV in your data. So what are we calling structural variant? So usually, there's no clear definition, but there's an accepted definition, which means that structural variants are genomic rearrangements that affect a larger part of your sequence. So we have usually more than 50 base pairs, but sometimes it's 20, depending on the criteria. So for people, it's 100 base pairs. Some others, it's 1 QB. And it includes deletion, insertion, inversion of sequence, mobile elements, duplication, translocation. Just to give you an idea, many years ago, before we were in the NGS area, calling structural variants what was done like that. So you can imagine we're looking at bond, karyotype, or fish. So you can imagine how complicated it was. This is an example. It was working, for example, in cancer. It was an example of what can be found in terms of structural variant understanding we have before NGS area. So I talk about SV. So SV is a different class of SVs. So there's a copy number variant, which has a deletion and duplication, which we call CNV, that are large deletion and large duplication, that are usually treated as a separate signal. There is a copy, a neutral rearrangement, inversion, translocation, and the other type of variant. So there's three more classes. And if we talk about structural variant, what is important to do is, when we talk about structural variant, we talk about what we observe in the sample compared to what is in the reference. Because in your samples, you don't have a variant if you compare your sample to your sample. What we observe in your sample compared to the reference. When we talk about deletion, we mean that in the sample, we have found that this sequence from the reference has disappeared. But there's no deletion physically in your sample if you compare to it. It's own sequence. So keep in mind that it's always based on what reference you use to call a structural variant. And it will impact on how you observe the data and what looks like a signal. So a deletion could look like reads that are really large, which initially people would say, no, it seems that there's a piece, there's more data in it. No, it's large because they have loose. And you look at the signal from the sample in the reference. So you need to revert back from reference to sample to really understand what the SV signal will be. So we have the different signal, deletion, insertion, mobile element insertion, tandem duplication, inspirative duplication, inversion, and translocation. So just to come back. So starting around the time, we're doing karyotyping, fish. Then people start to look at CNV using CGAJS array or SNP array. And now we are at the area of, I should put, DNA sequencing. Just to give you an idea of what impact on CNV when we are working with CGH, we only have information about amount of DNA per region. And while we go, I mean, it's advanced technology. We're able to bring the same information, but to add additional information to give us a more clever way to look at the variant, for example. If you look here between the different copy, it was complicated to find a different copy. But with the different signal of the Alevic ratio, it's more easier to make the difference between the copy. Same thing for dilation or mosaic loss. You see it's not the same signal. So you have more information. Well, you have a stronger signal and stronger calls when you bring more information. So theoretically, in the NJS data, you are able to call almost every type of sutral variant. Depending of what method you are using. So your point mutation, your indels, dilation, transportation, everything should be doable. To do that, in NJS data, we use different strategies. So there are four strategies that are used that I will give you more detail after that. Read pair, read depth, split reads, and assembly. So read pair. What's the rationale behind the read pair approach? So when you do read pair, you want to identify the breakpoint, so where is located your sutral variant, by looking at how read are aligned. So when you sequence a library, you know the shearing that have been made, so you know which insert size is expected in your data. And you know, depending on which type of library, the orientation of your reads. So the read pair method, try to look if you see discordant pairs, so pair data, discordant, so a normal insert size or a normal orientation. And you try to look cluster of evidence at the same region, or the same signal of discordant pairs. So what that means, so we have the insert size distribution. So it's why it's important to have a tight distribution around your mean. You want the lower standard deviation, because you will say, OK, this is my mean. I've got my standard deviation. Everything that is more than whatever standard deviation, 1, 2, 3, depending on what accuracy you want, would be called as a discordant in terms of insert size. Then you've got your insert size. You also have your reads. If your reads are in the same direction or in another direction, you will say, OK, if I have a change in my orientation, it will be also a discordant approach. Now what that mean in terms of signal to detect? So if I have a co-cordant data, we can see we have an insert size that is correct, and we have an orientation reversed forward of the read. Now if I have a variation, so if I lose, so this is the top, this is the sample, in bottom, this is the reference, you know. So if I have a variation, if I lose part of my reference in my sample, as my fragment come from my sample, they don't have this, this fragment are from normal size in the sample. But when I will map back this fragment, I will have correct orientation, but a really large insertion depending on the size of the pieces of the reference in them that I've been using. So I'm saying if I look at an insertion, it will be the opposite. So I will have in my sample fragment of the correct size, and when I will map back to the reference, the two reads will map really closely. So it's why it's important here to have a small standard deviation, to be able to have a resolution, especially for insertion. If you have too large standard deviation, too standard deviation will be more than your reads or your insert size, and you won't be able to catch insertion. Yes. How would you go about detecting an insertion larger than what you're reading? You will not do it with Reaper. You will use other inputs. If you are looking at problem duplication, what you will see, so you have one fragment. So if you have a standard duplication, you have the two fragments side to side, and you have the two extremities of your repeated fragment that will appear in your reference to be as opposite and large insert size, depending on the size of the fragment. So here you will have the inverse orientation with depending on the size, possibly change in the insert size. For an inversion, you will have signals like that, where one is normal, the other is in the same direction, but with a really large insert size, and you will have the opposite breakpoints that will pop up at the same, and you will see the two signals. Or the tandem, sorry. Why didn't I have one with the directionality of the opposite in Krister or Sine? So you will have some reads that will be in the week. So this one will be, the read that span here will be here with orientation, the read that span here will be with orientation, the one that are span around the location of the breakpoint where it will be here. So these three, as well as double-coffee will arrive at the same position. Oh, well, the man wants to play. And this one will be there, so instead of having two reads there, it will be at the opposite direction. OK, if you have insertion from long distance, you will have the same kind of signal, but it could be another problem, or really large on your chains. So here, a non-exhaustive list of repair tools you can use to detect your structural violence. And there are many new ones, and every new one appears. Before we move to the next one, I'll just give you a kind of overview. This is some summary of different projects where the different structural violations have been called by a repair approach. And what you can see is there's a kind of bias in what type of events would be called. Majority repair will call your vision. It's better for the other. It's more complicated. Another comment on repair is that repair is good for a simple event, but at the moment, you start to have complex, dynamic rearrangement in region. The signal starts to become completely crazy, and you cannot make sense of what you are. So what are the strengths and weakness of this approach? So the weakness is difficult to interpret repair in complex region, in repeated region, because it will be a lot of noise and a lot of events. It's difficult to characterize early around the region, and you have a lot of false positives when you do repair. So you will have a lot of calls. And also, you have some issue with some calls that have not been called in fashion. They're really difficult to call with repair. The strength is that, theoretically, it can call everything. So as I said, it's not fully true, because in session, at the moment, your follow-up division is a bit too high, you will not be able to detect your insertion. The second method is the street read. So the rationale behind this is more depending on what type of second thing you have. This is tools that are more focused on short-week sequencing. As we were discussing yesterday, if I have to do some project with, for example, bacteria, I will go with longer reads, do some assemblies, and call my virtual variant with other tools. But because I use different approach for sequencing. So this is the one that I base for. And all this presentation shows the method that I base for short-read sequencing. So depending on what you are doing or the specificity of your organism, it could change what strategy of sequencing you will use. And in that case, it would change the tool you will use. But if you use short-read, this one should be a solution. But I say it's a non-exhaustive. So that probably could have, if you have a problem with auto-types or low ed, sorry, it could change things. So you could pick for specific tools. We can take that into account. Split-read. The rationale between split-read is when we look at the reads, you have the element. And you have the insertion, the pair that we use in the read pair. But you expect to have a certain quantity of reads that really fall at the breakpoint. And the split-read approach tries to use this set of reads. So the reads that fall directly on the breakpoint should show a signal where part of the read should map at one specific location, the other should map another location. So we'll discuss that at the end of the split-read section. But you have the disadvantage that you need to have read that fall directly on the breakpoint. So probably you will have less reads than if you look at the signal with the read pair. And if you are able to have longer reads, it's better because you have more chance and tool observe reads that are covering the breakpoint. So what you have, you do your read, you align. And most of your pair will be aligned correctly with the reference random that you will have some pair that will be either align one read aligned, the other won't align, or the second read will just align one fraction of the read, and the other fraction will not align. So you will focus on this kind of one mappered reads, and you will try to get the signal of the split-read. So how it works. So here in the top, it's the read pair signal. So if you have a deletion, this is your reference. This is your donor. You expect to see on your reference larger inter-size or smaller inter-size. In terms of split-read, what you expect, you expect read that cover the region, like that in your donor, will be split in one pieces, one side of the event, the other at the other size. Same thing for the insertion. As the same problem as previously, if the insertion is too high, it will cover the read, and you will only see one region that will map, and you will have to try to get a sense of that. In terms of inversion, you will have read that map here, and you will read that map inverted at the other orientation. Same for the two signals. So as I say, having read that cover a region is really good because it helps you to have read the exact breakpoint. If the signal is really clear, the thing is that you usually don't have enough read to do it. So if you want to do it alone, you need to lower the number of events to generate that signal, and you really have a lot of false positives. So what is done now, most of the method will use both read pair and split tree at the same time. It's a really good combination because a lot of split tree, a lot of read pair will be used to really go and do a first pass to look at the data, and then the split tree will come to try to map the breakpoint and to confirm what I've been seeing. And it could be good because it's also allowed to detect small deletions that read pairs cannot detect because it's over the standard deletion and the precision of the method. And this is a tool, as I say, a non-exhaustive list. And what you can see is some tools that we have seen in the read pair list are also in the split tree, like deli and lumpy. That's a tool deli we will use during the practical. And this is because tools use both approach together. So if you're looking to reinforce the split read, the best thing to do would be to have all read. Sorry? Longer read. Longer read. Because longer read, you will have more read that cover the same bit. And you will have also more chance to catch larger events, for example, larger insertion. So what are the weaknesses and the strengths of the method? So the strength is that it really works well in addition with read pair method. It's how you use the best pair resolution of your breakpoints that you won't be able to do with read pair. And it can detect very short events. Weakness, need more coverage, and have a lot of false positives if it's used alone. Next method is read depth. So read depth is mainly used to detect copy number variation. So it's a really, really good approach. But one thing, it has a main assumption. At the beginning, the assumption is it assumes that you have an homogeneous distribution of your read, so that you have an homogeneous coverage all along your genomes. We discussed that with GC content. This is not true. With no probability, this is not true. So that's the main issue that we need to face. How it works, the main read depth approach is you divide your genomes into bin of equal size. Then you estimate the depth of coverage in each bin. And then you look for a cluster of consecutive bin that show deviation in your coverage. So the principle is really easy and simple. The main issue is that this main assumption. So theoretically, how it works, so if you have deletion of duplication, if you have deletion you will see your coverage. And then you will see a lot of bins that drop coverage because you don't have DNA from that. If you have duplication, all the copy of your region will map to the same region. And you will have a bump of coverage. So really, it's simple. And what is interesting is that this strategy is kind of the child of what had been developed with CGHR at the beginning. So all the methodology has been there because CGHR works on the same principle. You count on your how many amount of DNA you see on a regular basis. So all the methodology, statistical approach have been already developed. So this is an example of what you expect to see. So this is the signal is clearly, if you look at the coverage, you clearly see a big signal and you can clearly see. So when it works well, in germline sample, it works super well. Tumor, which is more important, then we start to have issues. This is an example of what we saw in tumor when you have good tumor. This is a visualization of the tool that we develop at our center, which is called scone, where you have your normal coverage, your tumor coverage, and the ratio of the two. And you are able to detect which are region with a copy number in your semantic copy number. When the signal is there, when there's no real noise, it's super easy to do, depending on which resolution. When you start to go with small resolution, you start to have a higher impact with the noise and the local variation of your coverage. So what's the main issue with this method is the normalization. As we say, the main assumption is that all your being, so what you expect, your expectation is that all my being have the same coverage, approximately. And when I have, for example, a duplication, I will see that. So if the signal is like what we expect like that, super easy, as we see here, super easy to detect. Now, in reality, is that what we observed? The signal is really noisy, and we have this illusion. Knowing that this one is different from this one, it's super difficult. Why is it super difficult? Because almost all the tools and the tool I designed, I developed, this is the one we use there, works that way. We do normalization like that, normalization with all the being around you. So there are new methods. And one we have really, if you have enough sample, I recommend it to you to use when it's called PopSv. It's a copy number variation method where the idea is to take, to work in population. If you have enough sample, what you will do, instead of doing your normalization like that, you will imagine that you have your, this is your real sample, you have other real sample packed all together, you will normalize vertically. So you will take your beans at this region of the genomes, and you will look in every sample what is the common variation of coverage in that bean. And if you have a deletion in one specific sample, this should vary, it should be different from the other sample. So whatever's your coverage, whatever's your other bean, is there affected by local variation, you don't care because you will measure only that bean and the variation in that bean. And I have to say, me, I developed this tool which is called SCON. This student have developed PopSv, and if you have enough sample, it keeps house whatever method you will do. How do you sample the unit for that? So that's the main question I have always asked him, and it depends on the coverage. Imagine this, it's starting to work, it's like 30 is the safest number. If you have 30 samples, it will always work. If you have a really good sample, really good quality, 15, 10, 15 could be used. But to be safe, you always say 30. So how it works is like that. So for each bean, you will do the local variation in the population, local variation, and plot the variation of your individual sample compared to the population. And then if you have an event, it will show that your local variation is different from what you're expecting in the population. And it has a really good impact to go, and low map ability region, small precision, it works really well if you have enough sample. So it's perfect. So this is the list of tools. There's many, many, many, many tools you can use. And also, all the CGSR tools can be applied. So what are the weakness and strengths of this approach? The strengths, fast, simple, easy to interpret. You see some graphs that tell you it's easy. You see the relation. It determines the copy number. So it could give you a genotype. And with the population approach, you are powerful to use that with low coverage or with really low resolution. Weakness, if you don't use population, you have a resolution which starts to be a bit high between 5 to 10 kB for the smallest event. Sometimes you want to go lower. Use a breakpoint ambiguous, because it depends on the size of your pins. So some method now use sliding pins to try to get a better resolution of the breakpoint. But it's still complicated. So when you have to find your events, going with other methods to retreat to find your real breakpoints is better. And you cannot find a balanced rearrangement. So things that compensate the copy number. It also applies to DNA only, right? I mean, you cannot use it for RNA. We use it for RNA. But the depth sequence is back to DNA. So what we do, we increase the resolution. So when we apply it for RNA, we take a resolution really large. So we cannot go with really low resolution. But we take pins so large that the individual expression of each gene will not affect the total number of count in the bin. So it works, but to make it work, you need to really get a significant amount of. So we take count in each example. So we need to have a certain example that one expression of a gene will not bias the full bin. And we are able to do that in a single cell. Single cell RNA. We are able to do that, or it works really well. In terms of the last approach is the assembly approach. The rationale of the assembly is what you are doing. You say, OK, one of the main problems of all the SV methods is that I'm using a reference. And I need to go back to reference and then back to my sample, and that make me problem with all the mapping, quality, distribution, all these technicalities that make me think it's difficult. So the idea is to say, OK, forgot about the reference. Take my read. I'm doing a Denovo assembly. I generate long countings, long straight of segment. And when I've got to generate my countings, I can go back about this content to the reference and then have a better vision of the structural variant. So why is it working? Because when you do assembly, except for some region where it's complicated to do the assembly, your content will be way more larger than what you are in terms of reads. So your resolution, everything will be easier to detect your structural variant. The content are created based on the common sequence from different reads. So you have several approaches. It really works well to do it. And there's two different approaches in this rationale. It's to do world genome assembly or local assembly. World genome assembly usually should be the best one. But as we already said, when you do a choice, there's a cost. World genome assembly is cost a lot of resources to do. So end of time. So if you have time, resources, compute resources to do it, you can go with that. But it's really a resource intensive task. So when you do genome assembly, it's exactly what I do. If you do the novel assembly of your world genome, then you align your contigs on your scaffold. You group your scaffold, and you look how it runs on your genomes. And when you see some scaffold that contains it on different regions or separate regions, you could probably have an insertion of the initial event or other events. So it really works well. It's a lot of work to do the assembly. A lot of work to make sense of the data. But I think that's the best way if you have resource and time to do it, that's the best way to detect a complex structural variant. The local genome assembly try to reduce the time and resource to do that, what they do, when you have specific events around your structural variant. What you will have, you will have some reads that either don't match the genome because of the structural variants, or reads that we called one end on the read, which have one end that are mapped in the genome. The other are unmapped. So they take all these reads and do local assemblies of regions where we have unmapped and one end on the read reads. Then we generate the contigs. And we map by the contigue to the genome to use that. And we are able to use the one end on the read as on card to know where the data, the structural variant, will be located. So to my point of view, it works well, but not as well as what we have with full the NOVA assembly. So you can see the different signature you will have in the local assembly. So what is cool is that it's able to detect almost every type of SV, except what is the weakness of the type of assembly. It's for duplication. Duplication sometimes is hard because repeated region at the border of the region, it's what makes the algorithm failed. What are the different tools? So all the assemblers that could be used, context, SGA, ABIS, SOAP, whatever. And there's these tools, which do a local one, which is called VABA, that we use for doing local assembly detection of structural variants. The weakness, very, very intensive in terms of resources. And it's hard to resolve some time region on the genome. Strength, it's really a good resolution of your break point. And almost all variation could be detected. So just to give you a summary of the different methods, so you're going to form depth of coverage to assembly. And you can see the resolution that you can get with each method for a slide. So you can see that you could have lower resolution with method, but it's really easy to do. So it's time, as I say, it's a choice. I want to take this method, but I will have the resolution, but it will be easy to do, and fast. I want to have high resolution. It will be complicated to work. So it will pass me more resources, more times, and so on. So in summary, we have seen there are four different methods that you can use each one as their strength and weakness. So as I say, no recent tool use combination of different methods. Deli and Lumpy are the first one that did that. Now there are other tools that do that. And during the practical, we will use Deli to do the analysis. What is the major challenge when you do the structural variant detection is to deal with complex events that we saw this complex region. The tools will give you a lot of signals and noisy signals. It's to find the real break points. So this is a two computational challenge. And then the third challenge, which is more a lab challenge, is then to do validation. As we say, as for variant, when we do calling, it's prediction, we want to validate the data. Validating large events, it's quite complicated to do. So you can either take long reads, or you can try to do long-range PCR. But it's not working super well. So validation of SV is complicated. It's why the field is still evolving so much. And it's still a new method because no one was able to generate a real truth set, validate truth set of structural variants that we can use to benchmark methods. So it's why everybody proposes method, do simulation, do blah, blah, blah. But there's no one truth set. OK, we know this one. All the structural variants are that, that, that. Run the tools, and we know which are the false positives. As we have for SNPs, it's not possible for structural variants now. Just a quick slide on visualization. So you already seen that yesterday. When you see deletion, you expect to see like that. So read that map with the exercise around the deletion. Tondem duplication. You expect read in the two directions, that map, like that. And inversion. So you've got your two set of reads there and there. That are opposite direction. Insertion in the reference. And just to show, now when people try to display structural variants, this is the way people try to move to show that. So I don't know if you already saw that kind of graph. It's a way to show a circular, so it's called circle plot. It's a way to show in the top the different location of your variation and here the location of the translocation. This is how people do that. And that's it. I think it's lunch time. No, not lunch time. Any question? So circle plot is circularization. So it's here. You start here. You have closer one, two, three, four. So circle X and Y here. So this is a circular representation of your genomes. And then you have tracks that show, for example, at this point, I've got an event. The number could be whatever event. So you plot your event. And in the middle, it's a way to plot a translocation because it's complicated to plot link between two chromosomes. So it's just how it's usually to plot structural variant. So you've got the circular genomes. You've got different track, circular tracks that show it could be also point mutation, but local event. And then in the middle, you have links that show a translocation between the region. How many translocations are there? I don't know if it's a real circle spot of genomic because it's useful. But if you want, at the break, I can show you a circle spot of a real event. And circles, if you want to do circle spot, you can use circles, the website. But it's super complicated to work, not super complicated, but it's complicated to work. I didn't ask you to have a good notion of format and data manipulation because the format is a kind of unbated format. So if you want to do it more easily, there's a lot of package in R. You can use to do that if you're familiar with R, statistical language. It's just like, circleize, the package circleize, do the job really well. There's also a website, I don't know, OK? Yeah? And circles and reading. What if you have only an excellent reader? It's more complicated. Usually, we can set some method, like some CNV method, and especially the PopSv population approach, we'll be able to go for CNV in Exxon because you're looking at each bin. So it would be the same at the moment. We have the same kit and same bits. But the problem is that when you do Exxon, it's a question of, you have your region and it's captured by bits. So the number of bits is not uniform and it creates artifact in terms of the signal of the pair or the strength. So all these methods do not apply really well on the.