 starting from the beginning. We're now going to be looking at not just calling single nucleotide variants and small indels, but looking for larger structural variant. So the objective of this module is really, well, understand what I need by structural variants. How can you discover structural variant from next generation sequencing data? There's lots of different ways. And a little bit like I started saying with indels, it's harder to detect structural variants than single nucleotide variants. And I'll explain a little bit the difference and the weaknesses of the different method. What patterns do you expect? I guess you guys started already in the IGV session yesterday looking at deletions and patterns that you expected, but we'll go over that in more detail now. And detect them using the program deli that's running, and then looking at some of them in more detail. So what are the structural? What do we mean by structural variants? So there's lots of definitions, but it's roughly something that affects something that's bigger than 50 base pairs. So you can think of single base pair change, small indels, structural variants, we typically mean something that's just a little bit bigger, so slightly bigger chunks of the genome that are changed. It can be deletion. It can be an insertion. It can be an inversion. You've got mobile element transposition, duplications of whole sections, translocations, or anything. All of these bigger change, this is what we mean by structural variants. It's been known for many years that these types of structural variants also occur. So as soon as you had technologies that could look at chromosomes, for instance, we could detect things like translocation, using this fish technology. But all of these are sort of high level, low resolution methods. Now that we have all of this detailed sequencing data, we should be able to extract from all of these reads very, very high resolution information about these changes. And exactly where on the chromosome is that translocation? So there's all of these larger chains that are happening to the genome. And so really the objective of this module is how can we extract that from the read information. This is particularly relevant in cancer. So this is an example of a normal tumor. So the chromosome painting here, in the normal genome, we painted just one color. The fact that they're all jumbled up is because there's all of these larger rearrangements, larger translocation, chromosome duplications, and so on. So how can we, if we sequence and we all we have our bits and pieces of the genome, how can we actually extract back this information? So the different classes of structural variants. So you've got copy number. So this is the deletions and the duplications. You're changing the number of copies of a particular section of the genome. You have these other rearrangements that are copy neutral, so inversions don't change the content, but are changing the order, same with translocation. And then what we can call other structural variants with normal insertions of sequences, or transposon that are jumping around, and so on. Here now is sort of a more visual representation of these. Maybe I use this. So here you have, so deletion. So the reference genome lost at copy. So this is just showing this here, a new insertion relative to the reference. There's a piece in the genome that you've sequenced. There's a piece, there's a new piece, a mobile element insertion, similar. This is an insertion, but just of a, and the trick here is that if you're talking about alloys and things like that, you have an insertion, but that particular piece of DNA is also found in lots of other regions of the genome, so this can be challenging to detect. Standard duplication, you have the same region that just gets copied next to each other, to interspers. So this is also a duplication, but the two segments are not next to each other. Inversion, as I mentioned, so you've got a whole section of the genome that gets flipped, translocation, two chromosomes exchanging, and so on. So a lot of our ability to detect these structural variants sort of follows the technology. So it started with karyotyping, just looking at very gross normalities of chromosome, CGH and fish, and some of these technologies that I was talking about, that we were looking at gave sort of slightly more precise, but you still didn't really know which genes were affected in detail or anything like that, and we were missing out small events in the early 2000s. In the early 2000s, so microarrays were used, especially ArrayCGH, to have much more precise profiling of especially copy number change, and now with the sequencing, you would think that we can detect everything, but as you'll see, the challenge is really on the informatics analysis of these data. So ArrayCGH is, you know, and SNP arrays, both can be used to detect copy number. ArrayCGH, you really have probes throughout the genome, and then based on the level of intensity, you have an expected level of intensity, and so if you have higher intensity of the probe, you're able to detect that there's a gain, lower, you know, one copy, zero copy. So, and then because you have tiling of probes throughout the genome, you're able to detect these copy-neutral, these copy, not neutral copy number change. One challenge or limitation here is that you won't be able to detect an inversion or a translocation using this type of approach. You can use the very popular SNP arrays as well. So these were used for genotyping initially, so SNP is you really have representative of all of the SNP in the genome. These were designed after, you know, projects like HapMap project to look, you know, which variants are present in different, but you can also use that because you can look at the level of intensity as well in these arrays at some level to detect copy. So, I mean, it is possible to detect copy number variants using arrays, but with next generation sequencing, in theory, we can detect those copy number change, but also other types of event, much more precise in theory. So whether, you know, what we were looking at in the first module was more, you know, these types of the point mutations, so you should be familiar, this is very much like IGV. You've got these Paranen sequencing, we're looking for these types of changes. In some case, we could detect a small indelicient indels as well, but now what we're trying to do in this lab is move into this other spectrum of, you know, if there's no reads in a particular region of the genome, it's likely a deletion. If you have more reads than you expect, you should be able to call it a game. We should be looking for, be able to look for these types of weird pairs that maybe point to a translocation breakpoint and so on. So, you know, all the information is in the reads and the question is, can we extract the information from the reads? But, you know, one challenge is that, you know, here, if you remember the step is that we would map all the reads, these were all mapped correctly to this location and then we looked for change. The problem with these guys is that, you know, you get things like this where the reads are not really mapping where you should be and all sorts of other things. So we're gonna be using other properties of the reads. So the strategies to call structural variance from reads is that we're gonna use the read information in a different way. So we can use the read pair information. So I'll get into each of them separately, we can use the way the two ends of the reads are mapping as a way of detecting structural variance. We can use the read depth, but this is what I was doing sort of by eye, to look for places where there's more read or fewer reads. We can use the fact that sometimes the reads are gonna be split over the breakpoint. So this is like, again, looking at the read itself and how it's mapping and then seeing patterns where it's getting cut, or you can do the Novo assembly. This is like in the session you did with Jared last night where you're starting from scratch. All of the other ones are using the reference as a... So we'll go over sort of quickly, but just to give you a sense of how these methods work. So using the read pair information first. So here, every time, well, most, so especially if you're interested in structural variance, typically you'll do paired and reads. So you have a fragment and you're reading at the beginning and at the end of the fragment. And typically when you prepare your sample, you prepare it such that they're fragment and this is especially important if you're doing structural variance, you want to have a relatively tight distribution of fragment size. Because then if you have, so this is a library where the target was to have fragments of 10KB and then we read the beginning and the end. So most fragments that you read tend to then have this distribution, they're all around 10KB, but you have reads where the beginning and the end are way too far once you map them or way too short. And we're gonna be using that information to identify regions that have had rearrangements, basically. But the key here is that your actual DNA fragments were all 10KB. So as soon as you map them and they're not 10KB, they're not concordant, you can use that information to say something is going on in the genome at that position. So you can use that information. So this is what you expect on the left. So you have the reference genome, you have the genome that you're sequencing. There's an expectation of how far they are. If when you map on the reference genome, they're too far, this means that there's probably a piece that's missing. So you can associate that to be the deletion because when you map on the genome, it's 20KB apart, but almost none of your fragments were 20KB based on the way you prepare the fragments. So if you see a fragment that looks like it's 20KB, it's probably because you have a 10KB deletion. You have the reverse. If once you map them on the genome, they're very close, maybe it's because there's an insertion in the genome that you've sequenced. So the read pair strategies are using information about how the pairs map to detect these change. Similarly, so this was just using the distance. You can also use the orientation in some case. There's a specific expected pattern of how when you read the fragment, you expect a specific orientation. So this is the orientation that you expect of the read. If you map on the reference and you see a pattern like this where... So one end is here and then the other end is not here. It's over there. Again, you can go back and figure out what was the underlying genome. And then configuration. So a pattern like this, when you map on the reference genome, it's probably an indication that there's a tandem digitization. An insertion is gonna have its own little pattern. An insertion, an inversion, sorry. An inversion is interesting in the sense that it's gonna have, you should, if it's a good inversion or if you have nothing, you'll have the two break points will give you complementary evidence of something that's weird. So this guy, this guy maps here and here. That makes no sense. You expected them to be like this and the same with this. So you can really, it's like a mini puzzle. You can turn it back and figure out what was going on. That said, and this is now an insertion from maybe a different region, for instance, or repetitive on it. So all of them will have very distinctive patterns in terms of how the pairing works. The trick is how do you actually implement an algorithm that looks for these patterns and predict the structural variance? So there's a number of tools that do that. So break dancer, deli lumpy. So the one that we're gonna be using is deli in this case. But all of them try to do the same thing. They look for this evidence of mismapping pairs, look for multiple instances, because just like with the single nucleotide, if you just see one weird read, you don't care. But if you see multiple reads that are saying the same thing, that's probably a good sign. Well, this is maybe not super interesting or relevant, but well, one thing that's, so this is one of the early on whole genome sequencing study that I know I participated in. And here we were doing this sort of semi-manually before these algorithms were out there and we were trying to interpret based on the read pairs and all of that, detecting deletions, duplications and so on. One of the things that, and again, at that time we didn't have whole genome sequencing. This is five, six years ago, it was still expensive to do whole genome sequencing and whole, you know, completely. So we had this paradigm strategy and so on, detecting. But one thing that's interesting in this table is that we had two normal sample and then we had all of these cancer samples. And if you look, one thing that should jump out as a bit weird is that the normal, oops, the normal sample look almost as bad as a cancer sample. And the reason for that is because there's a lot of germline structural variants as well. So when you're sequencing a tumor, you also need to sequence a normal sample to know which change or somatic change versus germline change. But it's not, that's not so bad, that's so interesting. I wanted to show you this just to show the complexity, though, of these types of data sets. So this is, well, it's hard to see with this light, but what this is showing is, so on top you have a race EGH. So this is just giving you an indication of copy number. As you sequence through the genome, there's different chromosome, chromosome one, three, 17, that clearly have these very big amplification of certain regions because they have multiple copies. So the thing we get with whole genome sequencing is the same thing in a way. So we get, the coverage just shows you that there's clearly some regions of the genome that have multiple copies, so you can definitely see that. But what you have on top of that is all of this paired-end information that gives you some idea of what's connected to what. So what the graph down here shows is, for instance, so here clearly there's an amplification of this section and clearly there's an amplification here, but this says that we had a thousand pairs that linked this DNA to that DNA. So we know that, we know more than just there's lots and lots of copies. We know that this bit seems to now, in this rearranged genome, be next to this bit as a tandem duplication of that little section. And we can actually go back. So this means that, well, we call it maybe an early event, but it's just, we know the specific break points and we know what is next to what. So we know maybe that there's a fusion gene or that there's a new promoter to that gene. And so it's useful information, but it's also sort of a bit messy and hard to interpret exactly how it's arranged and what happened. Anyway, so in terms of a summary of these types of approaches, well, I didn't discuss any of these things before. So it's challenging when you've got repetitive regions because those will lead to weird pairs because you just don't know where to put them. So it's harder to interpret structural variants when the regions are not unique because then you have mapping issues on top of that. When it's highly rearranged, it's hard to untangle and really know what's going on. So all of the approaches in the tool that I talked about tend to have high rates of false positives because, again, repetitive regions might lead to false positives. Highly rearranged regions are difficult. But in theory, from these approaches, you can detect almost anything. So it's still a valuable strategy, but it's gonna have a lot of false positives too. Another strategy, that sort of complimentary strategy is to use the read depth information. So this, we've sort of manually done a little bit in IGV. So if you look at the coverage, just looking at the coverage, it's kind of easy just to, it jumps out that there's more read in this region than in the rest of the genome. So it's likely that you have a duplication. So this also, I mean, so this looks visually very easy in practice. It's also a little bit challenging because other things affect the read density in different regions of the genome. So you've got, so it's also, there's a number of tools that use these approaches, but it can be also a bit challenging. Many of the approaches similar to the, that would work on the race EGH and so on, also work on read depth. You just need to adapt them a little bit to take into account the variability in the sequence. So typically you tend to try to bin and then count. But I mean, this is complimentary approach that works relatively well. This is another plug for some of the things that we do and maybe I skip this slide, but the approach that a student in my group has developed is that one challenge is that the coverage tends to vary across the genome quite a bit. So this shows multiple samples, multiple normal samples, and shows that the coverage across the genome tends to vary quite a bit. And we use that as sort of as a reference. And then when we're interested in one sample, we look at the amount of coverage in that specific sample, and it becomes very easy to see that it's an outlier potentially. But again, this is just one of the approach. There's lots and lots of them where you can apply, looking at the read depth to try to detect a copy number. And again, it's not a trivial problem. And especially in regions again that are repetitive and so on, there's still a relatively high rate of false positive. So as a summary, so we won't be using any of these tools in the practical, but of course you can try and you can visualize the calls in IGV as we've done for the small change. So copy number, so read depth approaches, I'm putting relatively low resolution because you have to have bins of a certain size. You don't get exactly the break point because you're just binning the data and finding places where it's going up but down. And just like the array based approach with read depth, you cannot see an inversion or the balance rearrangements. Strength, so from this, you're actually estimating the number of copies of that part of the genome. And that in some case might be useful. If it's a gene that's duplicated, is there two copies, four copies, five copies and so on. So if you don't have a lot of coverage, if you make bins big enough, you might still see this whole chromosome is gone because I don't have any reads on it. So you can sort of adjust the resolution. So that was this type of approach. The other type of approach that I won't go into much detail is the split read approaches. And this is a bit similar to the read pairs, but here you're really looking at one of the reads basically overlapping the break point and giving an indication that there's something funny because half of the read maps here and then the other half. Before I was saying this end of the read maps here and the other end of the map over there, this is really the read itself that just partially maps at this location and then it breaks out. And so you've got approaches that specifically target the split reads. And actually, deli that we're gonna use combines the read pair and the split read to really identify these structural variants. It uses reads that map a little bit and then the other part of the read doesn't map. But again, this is very similar to the, but you can imagine again that you have to look into how the reads map in the file and these are all reads that map in a weird way in the file. So you have to scan the files to look for these pieces of evidence. So you need to have sufficient coverage for these methods to work because you need the read to actually overlap the break point, same problems in repetitive regions. You can combine them with read pair methods and so that's really what deli does. But this one is really great because you have the specific base. So you have a reading that maps to the genome and suddenly it doesn't map anymore, it maps somewhere else. So you really have base pair resolution through that. The last section, which I won't discuss much, but this is just you assemble all of your reads and then you compare to the reference after that. So this is completely like using approaches like Jared talked about last night. You assemble your genome, your tumor from all of the reads and then you compare directly the context that you've generated with the reference and see if there are differences. So the DeNovo assembly tools, but except that as Jared said yesterday, assembling something like the human genome is still quite challenging with the short reads. That's one of the challenges we're trying to predict as V, but you can definitely try, you know, you assemble from scratch and then you look at differences between the genome you've assembled and not. Weakness, it's computationally quite intensive. It's hard to resolve the repetitive and the complex regions. But in theory, that's the ultimate way. If we had longer reads and a good way of assembling the genome, this would solve the problem. We'd just assemble the genome that we sequenced and compared and compared to. So in summary, and I'm coming to the end of this intro and we'll move back to the practical. So you've got the whole range of approaches that, you know, starting from approaches that use depth of coverage that have low resolution but are quite easy. You're just binning and you're looking. You have approaches that are paired in, that, you know, it starts to have a better resolution. You have the split read that really point, but now you really have to sequence quite a bit to get enough split read to detect events, to default the novel assembly, but those are, in theory, high resolution but very difficult. It's more difficult than costly in this case. It's just difficult because assembling any genome is difficult. You don't know the difference between problem in the assembly and, yeah. So do you know any programs that can sort of give you a list? So if you're using a new novel assembly. That's a good question. I guess any programs that can pull out differences in assembly would be good, right? So there's lots of programs for do you know what assembly that actually help you assess and compare a different assembly to know which one is better and sort of zoom in. Personally, I have not used the novel assembly to call structural variance in human because it's quite, but it's really, I mean, at the end you're getting context out of your assembly, right? And then you can blot those on the reference and see which ones map perfectly and which ones don't, right, and then pull out. So, but any other tool that would actually help you sort of compare assembly just in general for the novel assembly would also in a way, but I'm not so familiar with things that specifically extract structural variance from that. There must be, but I haven't used any of them. Yeah, it's because for fungal organisms, you might say they're not very big. Yeah. And it's not that costly to do that. Yeah. And you're saying that it's resolution. Yeah, absolutely, cost. Because I mean, one problem is, so if you, like for instance, if you have a sec, you know, if you have an insertion of something that's not in your genome, right? In your reference genome, then if you're only doing things that are reference-based, that information gets lost. Those reads don't get mapped. And so they don't really show up. So it's just that like, again, I mean, maybe this is a bit too human-centric the way I'm presenting it. There's not much of that in human. Most of the structural variants are of the other types. And so we're not missing much by not doing that. And we can do that sort of independently. Do you have reads that don't map and things like that? But for sure, if you can do a de novo assembly, it's actually quite good to detect structural variance. And then you're comparing these samples. Yes. The inversions, I think the inferior, the inversions are not so bad because the only problem, so the reads map properly, it's just that they don't have the right orientation, right? So usually those mapped okay, and it's just that you need to have a program like daily to just scan these map reads and pull out and extract them. So those in theory are not so bad. In practice, I think while it's human, they're not so common. There's not so many inversions. So there's not so many, there's many fewer than deletions and duplication for sure. So they're less common, but in theory you can extract them from that. And they have this advantage of again, having a good inversion would have two breakpoints. And so you should have evidence on both sides. And so you can really have confidence that it's real because if you've got, I mean, in my slides, I have an example of it. If you really see that pattern, it's pretty clear that it's real because, okay? Yeah, so just before I finish with a few examples of what this looked like, again, this is inhuman, but typically what people do is actually apply many methods and combine the results of all of them because they all have false positive and they all have false negatives. So typically people try many methods and validate. So if you look at thousand genome project or this is a recent sequencing projects in the Netherlands for structural variants, they really apply, there's different tools that are good for different types of events. So they apply all sorts of tools. They look at the variants that are called in many tools and then either they focus on things that are called by all, but then they're probably missing a lot or they go in and they validate. But it's quite hard and tricky to detect the structural variants and interests in general compared to the individual. Oh, okay, but in terms of what you hope to see, so what you hope to see are things like this. Again, we've seen some examples when we were looking at IGV. This is the type of profile we hope to see for deletion. This would be a homozygous deletion because you have no reads at all in this intron and then all the reads flanking are colored here because they don't have the expected insert size, right? So they're all too far apart in the genome. So this was next to this probably and so on. So we'll see examples like this, but these are, this is a signature of a deletion, a homozygous deletion. This would be a signature of a duplication. So notice that you have more reads in this region compared to the flanks. And again, at the boundary of this duplication you have all pairs of reads that have the wrong orientation in this case, with this probably being next to that. And they're like, on the reference, it makes no sense. That's because in the genome you've sequenced, these are next to each other, but it makes no sense. Here's an example of an inversion that I was talking about. So an inversion would lead to this type of pattern where you have roughly an even coverage. This is not a copy number gain, but you have lots of pairs, which you have one end here and the other end that's here and the reverse with pairs where connect from here to here. So this is a pattern where you've got a break point here and then here somewhere. And the whole segment is flipped such that this is indeed next to that and this is indeed next to this. And so this looks like a real inversion. Here's something that would be an insertion, a line insertion. So line insertion here where within the repeat mapping is difficult, but then all of these reads, so here they don't have the expected insert length. This would be an insertion in the reference in this case. But so we'll go through the exercise. We won't be, we're gonna focus on the deletion in the practical, but you'll see examples like this. Another cool thing with SVs is that it gives you an opportunity to make nice plots, but they look nicer than the SNVs. So these plots just show sort of all chromosomes from one down to here and these are all sort of insertions or translocations. These are big events, so you can represent them in a different way too. But I think with that, I'll take more questions if you have, but we'll go back and actually go through the practical and run deli and try to find some different types of events. Well, so the most common are really the deletions and the duplication. So there's even between individual, right? There's more of the genome that's different because of these structural variants than SNPs, right? So we have quite a lot, right? So there's what we're gonna see. We're gonna be in just one portion of the genome and there's lots of deletion. Most of the genome, we don't really know what it does and we have tons of deletion, right? So it's one of the reason why I don't wanna sequence my own genome. I don't wanna see where I have deletion and you don't have that gene. It's like, what does that mean? It's like, I don't wanna know, right? So it's like, because we have tons of deletion. I forget how much of the genome is different based on structural variants, but it's, I think 10 times more than SNPs, right? Because they're bigger, so they're less frequent, right? So you don't have 10 million SNPs. You don't have 10 million structural variants. I forget the number. You have a few thousand structural variants, I think, typically between two individuals, but they affect large chunks of the genomes. In terms of total bases, it's quite the bit. So, yeah. Well, and then in the complex regions, there's even more rearrangements. So we're probably underestimating the actual number of rearrangements because in the complex regions full of repeats, we don't really know what's going on from the short read data. We need the longer reads to sort out. So we probably, these are probably underestimated, looking in good regions. We already know that there's 10 times more bases affected by structural variants, small deletion, small duplication of SNPs.