 Okay, so that this morning was the easy stuff. We're going to move to the real stuff now Well, so so we're going to be talking now About structural variant calling Maybe just given some of the questions that I got during the break I'll just make a couple comments on on this morning. So most of what I talked about was Assuming that we were doing variant calling on the normal human genome But I'm sure that that that many of you actually are Thinking of doing variant calling in different contexts other other organism mixed population and things like that So there many of the steps are the same But the variant calling itself is going to be different because the expectation that you're getting 0% 50% or 100% no longer apply if you're sequencing and mixture of population and things like that so That's so that's something to to keep in mind. So again, you can also do variant calling in this context, but But then you have to use the variant caller or tell the variant caller that you're not expecting this deployed genome Which is basically The parameters that we were using in in the lab before so you can do it There's no question, but it's just the variant calling part itself would be slightly different So so so we talked about Structural a small variant calling in dels, but very small in dels now we're going to be moving to Structural variant calling so these are going to be larger events so the objective and Again, I mean most of what I'll say it's going to be Using the human genome as a reference Although similar principle also apply if you're doing using a different reference So the the learning objectives for this module is to understand and know what structural variants are To unto to appreciate how you can discover structural variants from next-generation sequencing data To appreciate the strengths and weaknesses you'll see that In the case of the first module. I really went over a pretty well established Pipeline, which is this g atk framework for structural variants. You'll see that there's really still still work in progress And there's lots of different tools and we're only going to be using one But I'll at least try to give you a sense of the types of tools that are available and then And then in the practical As I said, we'll be using one tool in particular and we'll start and we'll visually explore some of these structural variants So what what are structural variants so structural variants Are defined as genomic rearrangement that affect Some bigger size of sequence The actual cutoff, you know changes as things evolve So but keep in mind it's typically affecting You know bigger pieces of DNA so above 50 base pair or 100 or are bigger than the kb So these are bigger events So you can have deletion novel insertions and version rearrangements mobile element transposition so insertion of transposable element Duplication translocation. So this is so we're putting into this cat. It's this broad category any any DNA change that affects a big portion of the chain So, I mean again, especially so historically It's been known that there's these larger events that are happening that are affecting chromosome You can even It's such a nice and some of these events can be large enough that you can actually Visualize them and see them. You can have some examples of that here In cancer in particular So this is a sky image of a cancer cell line So in this type of representation, you would expect every Normal human chromosome to paint to be painted in a given color and what you see here the fact that their colors are all mixed up is that There's been lots of rearrangement so come whole chromosome duplication Translocation all of these larger events are our structural area so I mean historically again, this was was very well known but The resolution that we had the tools didn't really Allow to identify the specific breakpoints and the specific Change that were happening So so now with next generation sequencing. There's the potential to really sort of refine I mean, it was a lot of work to really Identify the specific break points from these high level maps But now the hope is that with next generation sequencing We should be able to really zoom in onto those break points and in the context of cancer finding recurrent aberrations So so just a bit more on the different type of structural variants So because this is really I guess you'll see the tools that we're going to be talking about Target different different types of structural variants So you've got Some structural variant called CNVs copy number variants and looking at deletions duplication So this is changing the number of copy of that region in the genome Copy neutral rearrangements so inversion that doesn't change the amount of DNA or the number of copies Translocation same thing And then some of these other structural variants that are sort of a mix of both So novel insertions So so the you know foreign DNA insertion mobile element transposition so So a visual representation of these now So and how can we well so this is sort of yeah the visual representation of all of this So deletion so you've got the reference genome. So we're going to be seeing lots of these types of plots. So a reference genome if you have a deletion Obviously, you're losing a piece of DNA novel insertion mobile element looks quite similar So the genome that you're sequencing actually has has this extra DNA here tandem duplication interspersed duplication an inversion that would just simply reverse The strand of a particular piece of DNA and a translocation which is here So I'm sort of going through this quite quite quickly, but I am sure you're also you're familiar with this So again as I was saying a lot of so it's been known for a very long time that that structural variants are important It's important in cancer. It's also important in a number of genetic disorders but it and it used to be that it was really sort of Core screen view of this and it was actually quite hard to to know specifically what the rearrangements were new technology such that Such as CGH and and sky that and some of the images I was showing, you know allowed sort of finer resolution of what these rearrangements were The the microarrays were used quite extensively especially To detect copy number and that's what I'll show now But then with next-generation sequencing and that's going to be our module in theory We'll really be able to do a much much better job at detecting these full variants So So before we go to the next generation sequencing Again, a lot of work was done especially to detect copy number variation using microarrays Because whether it's it's an array CGH or it's a snip array you can really You know by hybridizing different sample Observe that you have extra copies because you get more intensity on specific probes For different regions, so you're you're able to say You know using both platform, you know, there's extra copies of DNA or there's no copies There's loss and and be able to say Quite a bit about there was like velocity as well from the snipper rate, for instance So I won't I won't cover any of that The topic but but again, there's there's definitely alternative technologies to look at copy number variation Quite effectively for the most part and then we've learned quite a bit about Importance and how common C and V's are from all of these array based technology The a good thing with next-generation sequencing though is that In theory, it's not just going to be the copy number variant that will be able to pick up will be able to pick up all types of truck full variants so This is the kind of stuff that we were doing in the first module where these are reads, right? paired-end reads that are mapped to the reference genome and then From just looking at whether the reads match the reference. We're able to call these snips We're also able to call the small indels. So this We've we've done that already Similarly for now copy number If we don't see any reads from a particular region in the genome We'll be able to infer that that this was deleted or partially deleted Or if we see more reads then we would expect my chance. Well, we can we can expect that this is a game of DNA More than that You know, if you have that one end of the read maps to one chromosome the other end that's the translocation To a different chromosome, you'll be able to say that there's a translocation So again, there's no question that So the potential in theory we can really do all of that in practice It's really the interpretation of their read and Converting that into structural variant calls turns out to be I mean doable, but but still challenge So so I'll go so there's different Different strategies. So it's a different way of because in particular You know, if you think these the reason we were looking at these are the reads that basically map very well to the genome, for instance So a criteria for calling a variant like this was that the read map to the genome at that position And then you we looked at the quality of the base and then we were able to say there's there's a variant these reads Don't map on the reference genome the way you would expect So you have to do a little bit more effort at the mapping step And at the interpretation step to pick up these types of events So I'm breaking up The so how do we use this the next-generation sequencing read Information to call SV. So there's different Strategies and I am putting them There's basically four bronze strategies to use it. So one is to use this read pair information So you so a lot of the the reads are read as a pair and and From how the two ends of the read map will be able to Detect structural variants. So I'll go over that. So that's one strategy Another strategy is to use this concept of read depth. So this is going to help us detect the copy number So just the amount of reads in a given region Another trickier one, I guess, but also a source of information is is if one read maps correctly But the other read Basically only maps part partial. So this is called a split read where One read doesn't fully map it maps and then doesn't map anymore So we'll be able to use these patterns which we call split reads to refine exactly where the breakpoint is and then finally There's of this broad category of I'll get to that but assembly so forget about the reference genome just looking at the reads themselves Can we? Can we detect structural variants? Can we assemble the genome that we refer that sequence? So I'll go over these different categories and talk a little bit about the tools So the first broad category or broad strategy to call structural variants is to look at the information of the read pairs So so typically What the when you're sequencing Fragments you there's a distribution of fragment length that you expect such that when you map your reads back to the reference genome You know the expected distance between the two reads based on the fragment length distribution this is an example of a of Some type of a mate pair library where the Distance between the begin the read one and read two is is normal around 10 kb So you so the fragment Any fragment that you be sequence of which you sequence the beginning and the end you expect them to be roughly 10 kb So any read that where the two ends map in a way that's discordant big much bigger than 10 kb Or much smaller are suspicious So this is what we call these discordant reads and we'll use them to detect structural variant So Another way of looking at this is sort of a cartoon way So the the genome that we're sequencing is on top and the reference genome Is at the bottom and again we expect We know what we fed into the machine which was Fragments that are roughly 10 kb or or or shorter for the most part You know typically a few hundred base pair fragments for for a lumina reads So if after mapping on the reference genome We get that they're roughly 10 kb We all say this is fine But anytime we get something like this where the read one and read two are too far apart This is going to correspond to potentially A deletion so I mean Your reads sorry i'm up here. So they are 10 kb apart or And let's say 10 kb. They are 10 kb apart But when you map them on the reference genome, they look like they're 15 kb apart So that's because there's something that's missing from the reference genome Similarly, if it's too small On the reference genome, this is also suspicious. So this might be An insert So similarly So this is using only the information about distance between read one and read two You can also use information about There's an expected orientation So on the genome the reads should be you know should be phasing each other If if once you map them on the reference genome They don't face each other and in the order expected Again, this is an indication that there's been the rearrangement Um, but you maybe you start seeing how this is getting a little bit more complicated because you've got all sorts of special cases But basically anytime you see reads that don't map on the reference genome The way you expect you can sort of convert that back into a structural variant So you've got lots and lots of different case In the context for an inversion for instance If there's an inversion So this was inverted you expect actually One abnormal Read Over this breakpoint. So this one is abnormal and you also expect another one Um over the second so an inversion would actually be a pair of of unusual reading in this case So all of these again, these are the what we call discordant reads and This is for a distant inversion. So The idea Just like We did with a small variant is that you want to aggregate information for multiple reads that are supporting one of these Um, so there's a number of tools that do this In different way a break dancer was one of the first popular tool doing that um The the one that we're going to be using in the In the practical is called lumpy But again, all of these tools what they do is that they look in the mapping files At the the reads that didn't map the way they should uh and look for Evidence multiple reads supporting the same event and then provided lists of of event um I'm putting now an example, um Of a project that I was part of where we were start doing this Just to illustrate but this is quite some time ago already Well, but one point, uh, that's most small and hard to see here, but again, just from looking at Looking for these discordant reads and and clusters of cluster read were able to actually detect Places where there are deletions where there's tendon duplication And and all sorts of different events. So multiple reads Supporting the fact that there's there's a problem and but we'll go over that In the practical and and see some examples of how that's done So This was from the same study, but this shows that compared to array cgh that would just basically tell you Where um, where you have Copy number change where you have gains. So this is these plots on top Now you're also getting information which is down here kind of messy, but so this is information saying well you know, I have All the way down here. This was I have a thousand read that start at this position And end this position. So this suggests that in the tumor This is probably now really the breakpoint and next to that a particular piece of DNA. So you have much more finer information about specific breakpoints and and where the thing goes So I won't spend too much time on this Um, I just wanted to highlight Again, there's multiple methods that that do this and but uh, let's just say that the It is a more challenging problem than than small That's the small variant calling If you have repetitive regions, you're going to get in trouble because Here we're using reads that are not mapped exactly where You know, the you think they are So if if you're in repetitive regions, you might actually be misplacing some of these reads. So that's going to be a challenge If you have multiple breakpoints or multiple rearrangements within the region This is also going to create some very complicated Abnormal read pairs and so so that's also going to be quite challenging So so most methods still have quite a high rate of false positive So if you look compared to the very small variant call, you were getting a list most of them We could easily filter Most of these methods still make a lot of of calls that are not false positive The strengths of approaches of this type is that in principle You can detect really with these abnormal reads almost any class of of structural variant So this this is using information about read pairs Right how how whether both ends are mapping where they expect Maybe a slightly easier approaches or the read depth approach This is much more similar to what's done for copy number variants using arrays So here's an example Where it sort of jumps out quite clearly You've got a normal sample this is showing you the coverage And you have a tumor sample and and clearly there's many more reads that are mapped into this region But given that You know the DNA was all sheared at the same time the interpretation for this is that there's multiple copy In the genome somewhere Of that particular piece of DNA So so most of the approaches Again, you know, you're basically counting reads in windows The tricky part comes with the normalization and different things that end up affecting this coverage And if you have a targeted assay That's going to be even more problematic than a whole genome approach But basically you're looking at read density and then and then using that to call Um You know, it's uh The strengths of these read depth approaches that it's it's fast and relatively simple Um, it's easy to identify gene, you know amplification Um, again, it's it's relatively easy to validate with with utter approach like cnv approaches um The the weakness and the challenges is that It's not clear exactly what is the resolution of this and so the bin size how you select the appropriate bin size um So the the actual boundaries of the events are not necessarily Super well characterized And then you cannot detect balanced events. So if there's an inversion or if there is You know, you won't you won't be able to tell also if there's a duplication. Where is that duplication? so I mean I well, this is like a I'm used this opportunity as a A plug for one of the tool that we're working on actually um, so Ideally what you would expect is that the coverage is more or less uniform and then You suddenly have a jump in coverage The problem is that in reality Coverage if you even if you do whole genome sequencing coverage looks like this Even when when you have normal to copy And that's because of issues like gc content and and and the fact that in some case um The shearing itself is going to you know lead to some bias in terms of coverage So this is really what the coverage looks like So so the approach that we've been working on Uh, actually the strategy that we use is that we look at many samples at the same time where um, if you look in green what this is showing is The distribution of coverage of many samples So this is a hundred samples So you see that the coverage of all of these tends to go up and down But it's the same regions. It's the same for everybody that this region is easier to sequence and for which you get more read So by looking at this distribution across lots of sample Then we can when we look at one sample We can easily see whether we're falling within that distribution Or outside of that distribution and say we can call variants We can call copy number From this more effectively than than when we're only looking at One sample and trying to do it like this, uh horizontally Anyway, but so so this is just I mean I'm showing this in particular also to just say that This is still an area of active development. There's not just one tool that does this So there's many tools that that do this type of read depth strategy to be able to call copy number But this is still very much an active development both for read depth and the other tools so So we won't directly cover this Even though with the tool lumpy that we'll use we'll get some some evidence of deletion, but it's not Relying on read depth So a summary of read depth some read depth tools so weakness So relatively low resolution. So it's not clear exactly what size of bins to use to be robust, but you know, normally it's it's You know, it's Reasonable size bins of something maybe of the order of 10 kb You're missing some rearrangements because if there's no change in copy number you won't find it And but the strength is that I guess you do determine directly from that how many dna copy number you have And you know, it works relatively well, even if you have lower coverage The other approach will require typically more coverage to to work Okay, so moving on Moving on to to the the the next set of Strategies we can use right so we've used the Normal pairs we've used this this depth of coverage The other strategy is is the reads that themselves Are not mapped well on the genome. So it's not the location of the two pairs is the read itself that You know, it doesn't find its place in the genome potentially because it's over a break point Um, so this is well and this slide is not so great. So it's hard to see from this But this again goes back to the mapping step. So in the mapping step There's going to be a way to interpret reads that have exactly this this profile where only So the read needs to be long enough But if the read is 100 base pair and only the first 60 base match perfectly and then gets Clipped and then you know, the remaining 40 base don't map at that location This is evidence of a break point potentially So so various tool including lumpy that we're going to use Are going to again scan through your BAM and make and pull out those reads To say well this these reads are supporting a break point of this position If that makes sense So I mean Well, I have it here. I guess so This does require sufficient coverage because you need again to have hopefully reads that cover the break points Um There's again this problem that in repetitive regions you're going to have ambiguous mapping. So this might get a little bit tricky Strengths are that it really sort of goes in and hand with the read pair methods because It's it's really the same principles apply as the read pairs But now you're looking for the break to be happening within the read itself And so one of the other strengths is that it's really you get basically you can basically see the break point in your read Right, so the break point it cuts somewhere in your read. So you get Base pair resolution of your break points, which is which is quite nice um Okay, so hopefully you're still with me. So the last the last strategy Uh To to identify structural variance from from and she has data is assembly uh, so assembly uh, is really ultimately, I mean, it's very appealing because um, in theory why you know, you're sequencing a genome Uh, that may be a different from the reference So why map to the reference and then look for differences and things like that in theory You can take all the reads that you have and assemble the genome from scratch and then you would see all of that in practice That's really hard. So taking all the reads and assembling them into a good quality genome is really challenging Uh, and you tend to make more mistakes than correct guess so It's good for certain things But again, you're also going to have false positive because if you make a mistake in trying to First of all, it's it takes a lot of computational resources because Aligning reads on the genome is relatively easy compared to You know, if I give you 500 million reads at the puzzle with 500 million pieces, you know, it's it's not an easy puzzle um, so but but again, uh There are you know for reads, for instance, that don't map on the genome You might take all of those reads Assemble them or try to put in your c similarities So you can directly from looking at the read to look for these structural events as well The advantage here is that if it's form dna, for instance or something like that, you would also be able to recover but But again, so this you know, this is I think work in progress There's a number of groups that are working on on efficient methods that would directly from The unaligned reads before the mapping Looking for evidence of structural variance So there's I'm putting here a few tools And but this again, so I mean he whoops here the the weakness is that it's computationally very intensive People are trying to speed this up, but it's really hard to to assemble genomes Especially if you're doing this something the size of the human genome Um, it's quite hard to resolve patterns in repetitive regions. So I mean these are Always hard to assemble well, so it's It doesn't necessarily Solve all the problems that we had before But the strengths is that I mean if we could solve this problem well Then we would have for sure based their resolution of all breakpoints We could find all the class of variation ultimately that Hopefully will wind away, but with with short reads. It's hard to assemble a genome genome quite quite a challenge So Again, this I think it's it's important to to to know that I I think this is a it's a hard hard problem computation So a summary of the strategies, uh, like can uh, that can be used to call structural variance um, so Here they've been ordered from From low to high resolution. So how close uh to breakpoint resolution do you have? and From low to high in terms of difficulty So I talked about the depth of coverage method. So these are the easiest method. They give you copy number But you don't you don't have the exact breakpoints Because of these whatever binning strategies that you're using Um, and you only get copy number gains or loss, but you don't know where those are in the genome The pared end and the split read kind of go together with the split read bringing you even closer to the breakpoints um, and then And then ultimately you can you theory have the de novo assembly, but this is it's hot And you know, it's it's a perfect resolution at some level, but it's it's quite challenging And it's especially it's not just that it's challenging is that it It makes mistake as well. So having a perfect assembly using short reads is very difficult um so I mean from all of this I don't want to leave you to discourage But uh, I wanted to show you what people tend to do So I think state-of-the-art for structural variants calling is actually to use many methods um, so people So these are example from a thousand genome project and one of the recent population sequencing project as well And what people do is that they typically apply many tools Many tools that use different source of information from read pair to read depth So they use many tools in parallel and then they look for sort of Evidence that this is a variant by multiple tools or different approaches, right? So that's a way of enriching for True calls is that if you see it independently from two different tools, that's You know a good indication that it's working um So both both method both approach really, you know, they use 10 algorithms or something like that And then they they looked at the calls that were common and then they they went on and and do you know do a lot of validation so Again, so it's working and it's it's a great hypothesis generator, but it's still um, you know, there's no perfect tool that i'm aware of that that can call them And that's that gets it right all the time so Um, but I think so so so so that that's really that was the I wanted to set the stage what we're actually going to be doing um Is is look at some calls and look at so so these are the types of things that you would expect to see and again I mean I look at I show you this and then you're like you're probably going to say well This looks easy. I mean how come there's no tool that can actually pull these out Uh, so again the tools do pull them out, but they just also pull a lot of Things that are not real, so it's like there's a lot of false positives in these but I mean there's definitely some some some real I mean that's another thing. It's not working. It's just that there's high rates of false positives So this is an example where you see you get you have reads everywhere You have no reads in this region and on top of that you've got a lot of read pairs That don't have the expected distance that are too far apart. So this is very good evidence for deletion Um Here's another one. So this is very good evidence of a duplication You see that the read depth goes up. So you have more reads in this region And on top of that you've got all of these read pairs um that are Um Well the wrong orientation in this case, uh, they're in the wrong distance. So this is evidence for duplication um Inversion I talked about the fact that you actually expect A normal read. So you don't expect a change in coverage, but you you expect um you expect A normal reads You know on the two breakpoints on the two side of the inversion. So that's what we see here um So we're going to do in the in the lab. We're going to look at IGV Look at how we can color things to see these types of patterns But it's the same principle As with a small variant calling I mean You're going to see a lot of Regions where there's weird reads But we're looking for our lots of weird reads that are saying the same thing Right, so it's it's pulling out and that's what this the various algorithm are going to be pulling out of the data Is consistent Consistently weird reads um And and finally Here's a an example of an insertion of a transposable element So I mean especially It was true before but it's even more true for the structural variants After we've run the pipeline Where you're going to have prediction It's it's quite useful to look at some of the example to see what type of support the reads have for that particular event and sort of Learn what types of patterns to expect given the different types of events so deletion I mean again the read depth is a pretty it's a pretty obvious one And then the patterns of of a normal reads that's slightly slightly trickier to interpret um Another another thing and we won't cover that in the lab, but another thing is um Is you know you can make beautiful picture if you make if you call structural variants because you can So this circle spot is is one way to represent Uh sort of an overview of all of the structural variants that you would have in your data set because it's It's a representation where you have all the chromosomes and then you're showing that this is next to that But this we we won't cover but uh You know if you call the structural variant you can make nice figures. So that was the message of this slide Um, okay, so with that I think we'll we'll try to call some structural variants uh in in our data set