 Okay, thanks. So what we're going to talk about, I'm going to give a lecture, hopefully it'll take an hour or a little less, on what's called structural variation, as a complement to what Michael just described. You can think of, in some ways you can think of structural variance as really big indels, so we have big insertions of DNA, or deletions, or inversions, or translocations. So let me just skip ahead. And so we're going to talk about both the sequencing strategies, how we use DNA sequencing to identify these types of events in human genomes, or in an ethenome, for that matter. And then we're going to talk a little bit about some of the tools that we use to take the signals that we get from sequence alignments, aggregate them, and make predictions about rearrangements that might have occurred in chromosomes. And then the practical session is going to be very much like what Michael did. We're going to walk through the same data set that we were working with earlier today, but instead of finding SNPs and indels, we're going to find structural rearrangements. So the basic idea is what I want to, what I hope to convey is a basic sense of what structural variation is. I think many of you probably already have a basic sense, or even a very detailed sense of this already, have a basic understanding of how we use sequencing technologies to identify them, and to appreciate sort of the pros and cons of the three, possibly four different strategies for identifying these types of events, as well as what I call the signatures. So when there's a structural variant, what we'll see is that there are very specific sequence alignment signatures that suggest that a deletion, or an inversion, or a duplication has occurred. And I hope to give you a sense of how to identify those signatures. I'll try and memorize them while you're in the lecture. You can all refer back to the slides. Just recognize that there are signatures. So from a really high level, what we're talking about is compared to a reference genome where we've got, say, four arbitrary segments of DNA that I've denoted A, B, C, and D. Just like single nucleotide polymorphisms and insertion deletion polymorphisms segregate in the population, so do larger structural variants. So in this case, let's say each of these segments are five kilobases in size. In this chromosome, segment B has been deleted. On this chromosome, a new segment, X, has been inserted, maybe a tandem duplication of a 5KB segment. So now this chromosome is 5KBs larger than it was before the mutation, kind of inversions or translocations. And I think one of the most confusing things about structural variation when people first get into it is there's a ton of jargon. And terms are sort of often used interchangeably when they shouldn't be. And people think that there's a distinction between terms when there actually isn't. I thought I'd start by trying to clear up some of the terminology. First and foremost, people talk about copy number variation and structural variation. Structural variation is a superset of copy number variation. Copy number variants are nothing more than structural variants that change the relative copy number of a given segment of DNA in that chromosome. So say gene is duplicated. You say that gene is copy number variable because in that genome there are two copies, maybe in the reference there were one. You also hear the term genomic rearrangements. It's really just a fancier way of saying structural variant and possibly a more precise and less vague way of describing it. One of the key notions that I want you to recognize is that when you have a structural variant, there are so-called breakpoints, which is essentially if you have a segment of DNA that is lost, for instance, there's a novel breakpoint or DNA junction between segment A and segment C that doesn't exist in this genome. So when you compare this chromosome to this chromosome, you'll see a breakpoint here because it's basically you align these two things together and there's a huge difference that is segment B is missing. So here's the breakpoint. It's also called junctions, but the term that we're going to be using throughout this lecture is breakpoint. And if you work at all in cancer, you've probably heard or read papers talking about the recently more appreciated level of complex structural variation that exists in cancer genomes. So most of you are probably aware that solid tumor genomes, especially one of the hallmarks of a tumor genome is a high level of chromosomal rearrangement. It turns out that often there are complex rearrangements where within the same locus there's overlapping deletions, duplications, inversions, et cetera, that are very hard to explain through stepwise mutations, but more likely that what happened is there were some really complex nasty rearrangements, such as chromosome shattering, which you might have heard about, that led to that complex structure. And we'll talk a little bit about that at the end. So it's important to understand why we actually care about these, why we care about structural variation. First of all, they're far more common than we probably believe 10 years ago. A lot of that has to do with the technology that we're using. We have a much greater resolution to inspect and scrutinize the structure of genomes. It turns out that rough estimate, any two of us are genomes differ by about two or three to 10,000 structural variants. And even though there's probably a million to three million to four million single nucleotide polymorphisms between our genomes, that's way more than the number of structural variants. Because these structural variants are often very large, the total number of base pairs that are affected by these mutations or these polymorphisms is much greater than single nucleotide differences. We're also, if you're interested in genome evolution and speciation, these are often important events in terms of speciation, the large inversion and the chromosome, it's very difficult to have homologous recombination because of that big inversion. Genome instability and aneuploidia, as I mentioned, are hallmarks of cancer. And really, when you get into the genetic basis of traits, I think somewhat naively we often think about the markers that we use to study traits as just snips and just indels or even just snips. These are genetic variants that actually have phenotypic consequence and are in many ways just as important as the type of genetic variation that we're used to thinking about when studying traits. So as I mentioned, our basic understanding of the landscape of structural variation is really driven, like most things in this field, by the technology that we use to study it. So we've known about large-scale structural variation for a long, long time. Early site of genetic research and karyotyping showed that, you know, I was thinking, you know, do chromosome sorting, and chromosome mapping, you see aneuploidy. Techniques such as array CGH, fish, sky, cobra, all these different techniques gave us greater resolution. And then with the advent of the reference genome, we could design microarrays to target, you know, evenly spaced across the genome to look for changes in copy number. But what we're going to focus on today is using DNA sequencing to really get fine-scale, often base pair resolution of exactly where these rearrangements have occurred. And as the sequencing technologies get higher and higher throughput and the sequencing reads get longer and longer, it turns out that it's easier and easier, thankfully, to map structural variance with fairly high confidence. But one thing I should just say from the very beginning is that the pipeline that Michael took you through to identify SNPs and indels, the state-of-the-art and SNP and indel calling is far more sophisticated than the state-of-the-art in structural variant detection. We're going to walk through a fairly simple workflow, you know, it's easy to run the commands. It's just that, I'll talk about this more later, the false-positive and false-negative rates in structural variant calling are substantially higher, probably in order or in order and a half higher than, of magnitude higher than for SNP and indel calling. And hopefully you'll understand why that is. So as I mentioned, let's just define what a break point is. Remember that we're taking sequencing data from an experimental sample and just like identifying SNPs and indels, what we're actually doing is we have to align that data to the reference genome. So we reveal structural variants by comparison to the gospel, the reference genome, right? So when we do that, if there's a deletion of segment D, E, F, G in our experimental genome, we'll call it test, this is the break point. This is the novel junction in the test genome. So there's now a novel junction between C and H as a result of this deletion. But there's actually two break points in the reference genome. So there's often confusion about what people mean when they talk about the break point or the break points, are they talking about the break points in the experimental genome or the reference genome. And conversely, if there were an insertion of this DNA in the test genome, there would be two break points in the test genome and one in the reference genome. So this confusion led the people that defined the VCF format that Michael taught you about. VCF format can also represent structural variants. And to distinguish break points in the test genome versus break points in the reference genome, I guess novel adjacencies are the new break points in the test genome. So this would be a novel adjacency break point. And these are break ends. So more jargon, but you can refer back to it. So let's just get a high level sense of what these patterns look like. I mean, if you literally just sat down and drew a sequence and inverted it and mapped these break points, you'd be able to recapitulate all of this. But if we had a deletion in the test genome, which is showed, you have a novel adjacency between C and H because of this deletion of D, E, F, and G, now if it's an inversion in the test genome, as you can see, there's two novel adjacencies in the test genome. There's novel adjacency between C and G and between D and H because of the inversion of that block. Here's some more patterns. So if there's a tandem duplication, there's one new break point or novel adjacency in the test genome. If there's a new insertion of some DNA from another chromosome, for instance, let's imagine this is another chromosome. Maybe it's chromosome one and this is chromosome 10. Segment X from chromosome 10 gets inserted into chromosome one in the test genome. So when you actually try to map that break point, you can see all this sequence is going to line to chromosome one. This is going to line to chromosome one, but this little bit is going to line to chromosome 10. I'm not using a pointer because I have a terrible habit of occasionally shooting people in the eye with a laser, so I thought I'd hold off. And so if there's a reciprocal translocation between two chromosomes, very similarly, we don't have to walk through this, but basically you get these reciprocal exchanges in the break point. So chromosome one to chromosome two, I guess in this case, you get these reciprocal patterns. Now, the ultimate goal for mapping chromosomal rearrangements, obviously, would be to take every sample genome that you have, whether it's human or bacteria or whatever, do a complete and accurate de novo assembly, have your new complete genome, and just align the two genomes. That's the ultimate goal for mammalian genomes, where we just really can't do it yet, primarily because the sequences aren't long enough, and there's just, as you probably know, the human genome is highly repetitive. So in the interim, until we can do that, we use several different strategies. First is using read depth, we'll go into this in more detail. We can also use a strategy called paired-end mapping, I'll talk about that a bit more, and also split read or split content mapping. So let's go into each of those in a little bit more detail. So just a high-level cartoon schematic of what these different strategies are. The first two steps are the same, and it's basically the same two steps as you did with Michael in terms of identifying SNPs and endos. Take your FASTQ files, you align them to the reference genome, like BWA or Nova line or Mosaic or whatever, and essentially, just like you're trying to find SNPs, you're looking for some signal that is a different, that represents a difference between your test genome and the reference genome. It's just in this case because we're looking for structural variants, that signal is different. And unfortunately, it's just a little more difficult to separate this signal from noise. One strategy for doing this is called depth of coverage, where conceptually, let's say, the reference genome has this green block, but it's been deleted in this experimental genome, so this is the breakpoint of the deletion. So if that's true, when you align DNA from the experimental genome to the reference genome, you're going to get coverage upstream or up or downstream of the breakpoint of this region and downstream as well, and you'll have a gap in coverage owing to the fact that that segment of DNA was actually deleted in your sample. You can also use paired-end mapping, where I'm not sure how familiar, does everyone know what paired-end sequencing is? Yeah? Anyone not? Okay. So paired-end sequencing, we've sequenced, let's say this is a 500 base pair of fragment of DNA, we've sequenced 100 base pairs on one end, 100 base pairs on the other end. The unsequenced portion actually straddles the breakpoint. So when you do that, let's say this is a 5KB segment again, we know that these two ends are 500 base pairs apart, roughly, because of the library that's the size that we cut out of our gel. When we align these to the reference genome, we're going to find that, aha, this end and this end are now 5,500 base pairs apart. So it's like a too big alignment, which suggests that there is actually a deletion of about 5,000 base pairs in this genome. Similarly, split-read mapping is essentially the same concept, except that we've actually sequenced through the breakpoint. In this case, this is the unsequenced portion. In this case, let's say we have a nice 500 base pair contiguous DNA sequence and we've actually sequenced right through the breakpoint. So this half of the sequence is going to align here and this half is going to align. And as I mentioned, the end-all-be-all would be to do de novo assembly. And there are some hybrid strategies where it's maybe not possible to do it on the whole genome, but you could use one of these strategies to find signals for structural variation, basically then extract the DNA surrounding that region and do local de novo assembly to build contigs that can be then aligned to the reference genome to map these events with fairly high precision. Okay, so let's go into a little more detail about how we use read depth or depth of coverage to give us to infer that some sort of change in structural variation has occurred. But the important point to mention here is that all this tells us, we can only use this method to find deletions or duplications. If it's an inversion, that's a so-called balanced rearrangement, there's no change in copy number. There's no DNA loss or DNA gain, it's just been swapped. So here's an example from a data set that we have in my lab, matched tumor normal pair. On the top is, this is an IGV view, zoomed way out, so all these little gray ticks are individual sequence alignments. So as you see on the top, the coverage in the normal is fairly uniform and there's some difference here in the tumor. Can anyone guess what that is? What? There's more DNA, so it's a duplication. So in the tumor genome, there's been a duplication essentially of this segment. We don't know where it's duplicated, it might be tandemly duplicated, it might be duplicated on another chromosome, but essentially that chunk of DNA that's in the reference genome occurs more than two times in the tumor genome. So it's a very simple strategy. You align all your data to the reference genome and you could imagine, we just create a sliding window across each chromosome and we just march along and we ask, how many reads are in this chunk? How many reads are in this chunk? How many reads are in this chunk? And it's the same, it's basically consistent and all of a sudden we see a chunk where there's a lot more DNA, another chunk where there's a lot more DNA, another chunk of the genome where there's more DNA. We can integrate those signals across different windows along each chromosome to say, well, it looks like there's more DNA consistently in this region, this looks like a duplication. The problem is that there are many reasons why you can have fluctuation in coverage that has nothing to do with changes in the actual amount of DNA in your sample, the most significant of which is GC bias. So some of these technologies have, it's less so now, but three years ago when I was working really hard on this, these kind of techniques, GC bias was a big issue. So you basically, before you do this, counting by marching window, sliding windows across chromosomes to identify changes in coverage, you first have to normalize that coverage by the GC content in that region. What do you mean by CNAs? Copy number alterations, so yet another term for the same damn thing. It's just another way of saying copy number variant. It's my attempt to confuse you more than you might have already been. Did I succeed? Okay. So here's another plot of a larger region of an entire chromosome, and this is basically just showing you that you can identify, when you compare normal to tumor, you can identify deletions and duplications, and not only that, you can actually start to get into predicting what the actual copy number is. So it's one thing to just say there was a deletion or there was a duplication, especially for duplications, it's somewhat important to understand, well, how many more copies are there? And so this is basically saying, this is our expected 2N or diploid coverage, maybe this is 1N, maybe this is 3N, 4N. So the strengths of this strategy are that conceptually it's very simple to get, I mean, if you had even the most basic scripting abilities, you could get at this problem pretty quickly. You would use something like SAM tools like Michael introduced, or this program bed tools that we maintain, to just march across windows in the genome, count up reads, and just compare the counts in these windows between two states. So it's pretty easy to identify gene amplifications or deletions, but the problem is that it's fairly low resolution. Because we're creating windows across the genome, we need those windows to have a statistically relevant number, on average have a statistically relevant number of reads in them. So for example, let's say we only had 5X coverage of our genome, so we got 50 million reads. If we created a window size of 10 base pairs, we want to be able to detect 10 base pair deletions or duplications. On average, we've probably got 3, 4, 5 reads in those windows. So if you compare a window with 4 reads to a window with 6 reads in another sample, is that really evidence that there's a 50% gain in coverage? So the main point is that we want our window size to be big enough such that maybe the median number of reads in a window is, say, 100 or 200 reads, so that we have some statistical power to detect differences. And the size of that window to get that number of reads is a function of how much coverage you have. So that's why in this slide back here, saying the resolution is fairly low, but the cost is pretty low because you could actually maybe sequence 5X, 10X coverage of your genome, but if you make your windows 50KB, you're going to have pretty strong statistical power to detect deletions or duplications in 50KB windows. Downside is your resolution is only 50KB. Okay, so high level workflow of how you actually correct for GC bias. So on the top in blue is the actual read counts for a 12 megabase region on chromosome 17. So if we didn't know anything about GC bias in these windows, we would guess that maybe on the far right there's a big duplication there, and maybe right here in the middle there's a deletion. However, if you just simply map the GC bias along that same chromosome, what you see is the coverage actually tracks fairly well with GC bias along the GC content along the chromosome. So when you basically correct that coverage for the GC content, what you get is something much cleaner, and what you find is that this thing that looked like a deletion before is actually not a deletion that tracks perfectly with GC content, whereas this duplication over here seems to be real because it's much higher than the fluctuation of the GC content in that region. So essentially you take your counts for the same windows that you looked at, you compute the GC content in those windows, and essentially just normalize your counts by the distribution of counts for all the windows in the genome that had that same GC content. And then once you have that, you can just plot it. You can use basically a number of standard deviations or Z scores to basically predict whether or not there's statistical significance for that change in DNA content. Yeah? I'm wondering if you don't see this as a good conviction, like the ACAC? So basically if it's a very, so the one consequence of this effect is that if you have very, very high GC content in a given window, there are probably very few windows like that in the genome. So there's not a statistical power, the statistical precision of the distribution for that GC content in say it's like 70% GC. There's probably only 20 windows like that, just picking a number. So when you're plotting, when you're taking the actual read count for a window with that GC content, the statistic, the distribution is going to be very, very rough. So you probably don't have quite as... And so the standard deviation is going to be very high. So it'd be very tough to pick out a duplication or a deletion from windows like that. But in the 45 to 55% GC spectrum, you have high power because there's a lot of windows in the genome that have roughly that GC content. So this is really just a pretty slide of a normal versus two different tumors. I'm showing how, just through this basic depth of coverage analysis, after GC correction, how crazy duplicated and deleted and rearranged some tumor genomes can be. And we'll see this in more detail later. So really high level, all you need to remember is that if you want to just find deletions and duplications, so gene losses, gene gains, you can do it fairly cheaply with not terribly high coverage of a genome. The consequence is that you don't have very high resolution to know exactly where the deletion occurred. But you know, sort of the rough, maybe 50 kb or 100 kb window where it occurred. Depends on the cancer type and whether, you know, how the severity of it. I mean, the karyotypes for some cancers are insane. So, no, I mean, I think this could be very... In this case, it's real. But knowing actually what, you know, the problem with this is you don't know what genes are affected. You don't know anything about the underlying biology. And you don't really know the structure of the chromosomes. You just know that there's lots of places in the reference genome where there's more DNA or a whole genome. That's the entire genome. So from here to here, it's chromosome one, chromosome two, etc. Right, so compared to the normal, maybe, you know, normal is two N. You know, here... And so if you actually subtracted out the normal where you see those big spikes, this would look different. These are raw counts. So, as I said, it's fairly easy, fairly cheap, low resolution. So we're going to move on to paired N mapping, which I talked about a little bit. It's also called read-pair analysis. But I think the consensus term in the field is paired N mapping. The central concept in paired N mapping is that when you're doing paired N sequencing, you're size selecting the fragments that you're in a sequence. So the idea is that you know what a normal fragment should look like. So if you sonicated or nebulized your DNA and you're looking for a 500 base pair fragment library, and you sequence the Ns, maybe the 100 base pairs on either end of those fragments, there's 300 base pairs of un-sequenced DNA in the middle of that fragment. So let's imagine our fragment distribution, the actual fragments that we put into our sequencer, the distribution that looks like this. So this is a very nice library. It's a pretend library. But maybe on the upper bound it's at 600 base pairs and 400 base pairs on the lower bound. So the idea is that this is what our normal alignment should look like. And the corollary is therefore any alignments where the two ends are way out here, or way out here, that's pretty strong evidence that there's something weird going on. There's some sort of rearrangement. But the only way with paradigm mapping that we will see that abnormal mapping distance is if the breakpoint in your experimental genome is in this un-sequenced portion of the fragment. So the way we do this, let's say 400 or 500 base pair fragments, it doesn't matter. Just like you did with Michael, when you align with BWA-M or BWA or whatever, you're going to align both ends. So you have two FASQ files and one and then two. You're going to end up with a SAM file or a BAM file where you had 50 million fragments, you had 50 million alignments to the genome. And what you find is for human, typically 97, 98% of those paradigm sequences align as you would expect to the reference genome. That is with the Illuminate Technology, the leftmost end is in forward orientation, the rightmost is in negative orientation and reverse orientation. And the distance between those two ends is roughly what you'd expect based on the library that you generated. So in many ways, with paradigm mapping, you throw away about 97% of your data from the get-go. And you're just focusing on the so-called discordant alignments. So there's alignments that are too big, so it's, you know, 5,500 base pairs apart like we talked about before. That suggests the deletion in the test genome. Maybe they have goofy orientations. You expect forward reverse, it's actually reverse forward. That's evidence of another type of rearrangement. And maybe they have the same orientation, forward, forward or reverse, reverse. So each of these different patterns, these are the alignment patterns that we're looking for. And each of these different patterns suggest a different type of rearrangement. When it's too big, too far apart, it suggests the deletion in your test genome. We've gone over this. If it's reverse forward when it aligns the reference genome, that suggests a ton of duplication. Forward, forward, reverse, reverse suggests that inversion. If you draw all these out, you can see why it is. So paired-end mapping tools that are trying to find structural variants, essentially what they're doing is screening across the genome and trying to find clusters of alignments that look like this. And so if you just have one, you don't really, you might not believe that there's a structural variant because you can have chimeric molecules when you're doing your library prep that would lead to weird alignments. But if you start seeing it over and over and over again, just like when you're trying to do SNP discovery, you want to see it in more than one read to believe it's not just sequencing error. When you start to see clusters of these alignments, we believe that it's actually an actual event. And how many discordant alignments you require really depends upon what the false discovery rate you're looking for is and how much coverage you have, et cetera. So when I was doing my postdoc, there was a wild competition to make the latest and greatest paired-end mapper, just like there's 700 sequence aligners, short read aligners out there. It's not quite that many structural variant colors or paired-end mapping colors, but there's a lot. The one that I wrote as a postdoc, it's called Hydra. That's what we're going to use today. It's a really nice one called Deli from Yonkor Bell's lab that she's pretty extensively in the Thelacin Genomes project. Ben Raphael's lab at Brown. They've got a really nice tool called Gatsby Pro. And one that we're going to talk about in a little bit is called Lumpy from my lab in Ira Hall's lab. They're all fairly easy to run, basically give the same type of results. Some excel at different types of structural variants, more than others. The main reason I'm recommending these is because they're fairly easy to run. Some of the more obscure tools are more difficult to install, kind of painful to understand the output, and often fail. So we talked about the different tools for paired-end mapping signatures. You can skip this. The thing I want to focus on is reciprocal translocations. The slide's preceding us for basically duplicates. In the case of a reciprocal translocation, I'm not sure how many people here are studying cancer, but you probably know a lot of leukemias are characterized by a canonical reciprocal translocation. For instance, the BCRA more translocation. So whereas the reference genome, so here's chromosome one chromosome and another chromosome, let's say there is a reciprocal exchange between these two chromosomes in the test genome, what we're going to get are two different breakpoints, one on this chromosome and one on the other. And the alignment, so when we align this end, essentially when we look at this data, this is a given paired-end sequence. This end is going to align actually to that chromosome in the reference genome, and this end is going to align to this chromosome in the reference genome. And what we'll see the way you identify a reciprocal translocation is by essentially looking for a reciprocal event that suggests overlap with the same region of two chromosomes. That leads you to believe that there was a reciprocal exchange of DNA. So split-read mapping is the third technology that I'll talk about, and it's very similar in concept to paired-end mapping. So I think the easiest way to think about this is let's take the deletion example again where you have a segment B in the reference genome that has been deleted in your sample genome. So now there's this novel junction between segments A and C. If you had a paired-end sequence that spanned that, when you align it to the reference genome, and one is going to align to segment A, and two is going to align to segment C, and there's going to be this much greater than expected distance between the two ends, right? And that is because the breakpoint was actually in this unsequenced portion of the paired-end fragment. But what if the breakpoint was actually, let's say we had a 500-base pair of E, or even a 100-base pair of E, for that matter, what if the breakpoint was actually in the middle of this one sequenced end? That's what's depicted here. The sequence actually spans that breakpoint, so you're going to get half of the alignment to this part of the reference genome, and half to this part of the reference genome, let's say, to keep with our example, 5,000 base pairs in between. And just like paired-end mapping, there's very specific orientations that lead you to believe that this is a deletion. So in this case, we would require that this alignment and this alignment are on the same strand. If it's a real deletion, you shouldn't have mixed strands. And for a duplication, it's basically the same thing. We have these two ends aligning disjointly, but the orientations are different. So really, the main difference between split-read mapping and paired-end mapping is that the resolution of the breakpoints that we identify is one base pair. So we actually can nail these breakpoints down to the exact two base pairs that form that junction. Whereas in paired-end mapping, because we have this variability in our fragment sizes, even when we integrate multiple discordant alignments, we still, we often have narrowed down the breakpoint on the order of 2 to 500 base pairs. So if you complement these two strategies, you have much greater power to not only detect structural variance, but actually nail down exactly where the breakpoints are. So this is really important for trying to understand the functional consequence of the events. Okay, so now we're just going to talk a little bit about how many different genetic structural variance there are and human genomes and how that compares to different types of genetic variation. And then we're just going to, we're going to look at some of the real-world examples with real data. So we all, we learned with Michael, we learned about how to identify SNPs and indels in human genomes. You know, obviously the big difference here is structural variance are just big, bigger events that vary not only in state, but also in size. And in contrast to the SNPs, which are typically about 3 million per genome, we're talking about 3,000 structural variance per genome and maybe 300,000 indels. As I mentioned, it's fairly straightforward now thanks to great tools like GATK and SAM tools and FreeBase and all the tools that Michael talked about to find SNPs. It's still a work in progress to find indels and the tools that we'll talk about that we'll use today to find structural variance, that's sort of the hardest problem. And the main reason is, because not only do these types of genetic variants vary in state, some people have the rearrangement, sometimes they don't, but they vary in size. And not only that, they tend to occur in more repetitive parts of the genome. So there's this intrinsic difficulty in mapping these events because there are repetitive parts of the genome. So here's a couple examples of a real event. So this is a deletion. So this is an IGV shot of a deletion. The gray bars are concordant alignments. And so what IGV does is it marks alignments that are discordant. And you define discordancy by telling it what you think is too big of an alignment. So that requires some knowledge about your fragment size distribution of the library that you sequenced. And that's what we're going to work on in the practical sessions, how we figure out what those numbers should be. Well let's say we told IGV, mark anything red that is greater than a thousand base pairs. To all these alignments, it's buddy. The other end, maybe these two ends go together. They're from the same fragment. They're 1500 base pairs apart. So not only is this a deletion, but what kind of a deletion do you think it is? So this is a diploid genome. Right. It's a homozygous deletion because there's no DNA in between. If it was a hemizygous deletion, we'd see this pattern, but we should also see, you know, let's just call this as the average coverage flanking region. Maybe we'd see it up here to say that there was actually one intact chromosome without the deletion, then one with it. But clearly there's no coverage. So this is a homozygous deletion. So sorry, typically a human genome's got a few thousand deletions. The most common type of structural variant is a deletion. There's a few hundred duplications. So here's a case where, again, there's a duplication, but it's marking these, these alignments as discordant, not because of the alignment distance, but because of the orientation. We expect with the Illumina Paraden Library, the leftmost end to be forward, the rightmost end to be reverse, but it's the opposite here. So we've got reverse forward, and if you skip back to a few slides, you'll see that reverse forward, that signature suggests a tandem duplication. And in fact, not only do we see the Paraden mapping signature of a duplication, but we see the excess coverage as well. Yeah. So what's missing from this view, at the time that I made these slides, this technology I'm about to describe didn't exist. So what's missing from this view is being able to connect end one with its body over here, end two. So let's just pretend these two go together, and these two go together, and these two go together. So it's not multiple mapping. It's just I'm not able to draw for you the connections between these ends of the Paraden fragment. The what number? Ah, so let's take a simple example. Can you guys over, can you see from this angle? Yeah? Okay. Let's take the reference genome. Let's say it has, right, so let's say segment B was duplicated. When you sequence this genome, there's just two times as much DNA for that segment. So when you align it to the reference genome, segment A is going to look like that, segment C is going to look like that, but now it's two times more. Okay. So an inversion is going to give a different pattern, and that's actually the next slide. So thank you. So an inversion is going to look like this, where let's, let's skip back for a second and refer to one of the earlier slides. If I can figure out how to get my mouse over here. I have an inversion. Yes. So this is the Paraden signature for an inversion. So reference genome looks like A, B, C, D, but the experimental genome is A, C, B, D. So B and C have been inverted, right? So there's a new break point here, and there's a new break point here. When your Paraden sequences span that, think, think about what's going to happen. This end is going to be forward strand, but when this end aligns to the reference genome, it's going to align like that. So it's also going to be forward strand, because that piece in the reference genome is actually, let's say these are 5KB blocks. It's actually 5KB down, and it's all, because it's inverted, it's also going to be in forward strand. And then for this break point, it's the exact opposite. This is on reverse orientation, as we'd expect, but when we align to the other end, you're going to get, it's going to be on reverse orientation. So the signature for an inversion is forward-forward on the left side of the genome, downstream of the genome, and reverse-reverse. So then they must overlap like this. So let me just show you the real data. That's exactly what you see here. The issue is that, again, you can't connect the dots. So here's your cluster of forward alignments. Their buddies are, there's some forward ones here. You might be able to see it better on the printout. But then also there's some reverse alignments here, and their buddies are right here. You've got forward-reverse, reverse, and then forward-forward. Suggesting that, you know, the rearrangement, the inversion is right there. Does that make sense? If you sit down and draw it out, like actually draw it out with a sequence, it'll, I think it'll become clear. There's a lot of mental DNA gymnastics that you have to do, and just looking at these patterns to make sense of them. So the other thing that's fairly common in the human genome is probably all familiar with retrotransposition. Line elements, sign elements in our earlier form of the human genome are really, really active. So a lot of the repeat content in the reference genome is generated from retrotransposition. And they are actually segregating polymorphic variants in the genome. So some people have, for instance, here's a case where the reference genome has this line element. Sequencing this individual, clearly there's a loss of coverage. And so the idea is that, well, this reference genome doesn't have the line element in it. The reason I'm making this distinction is that you look at this signature and what would you think it is? It looks like a deletion in the way we've been talking about it, right? But if you think about the mechanism, it's actually not a deletion. Retrotransposition is a copy-paste event. So it's really just the reference genome had this an insertion, really, and our experimental genome didn't have it. One, I bet some people are probably scratching their heads as to why you have all these other alignments there. I think, Michael, you talked about mapping quality, right? So one of the things that goes into mapping quality is not only the quality of the local sequence alignment, how many mismatches there were, but it also is a signal or an indicator for how many alternate locations in the genome there were that were almost as good. Line elements are everywhere in the genome. They're often highly, highly identical. So all these sequences here, the sequence aligner chose to place them here, but the color indicates that there's very high uncertainty reflected in the mapping quality that this is the right place for this alignment. So there's coverage here, but this is just coverage that's randomly distributed by the aligner among all the line elements in the genome that are just like this. Does that make sense? Yeah. With that coverage there? Yes. So again, the call on top connects the dots. So basically that's the hydra, that's the tool we're going to use that integrates the signal from all these alignments, but what you can't see is that this end connects for the other end, just in the same way that that red call on top is. So why didn't the software program parallel those little empty boxes in that space just because there was nothing there, because there was a division? So there are a couple ways to handle this issue of sequences having multiple places in the genome to align. One is to ignore them completely and don't choose any of the spots. Just say, ah, this is repetitive. I don't want to touch that and not even report it in the output. So in that case, we see nothing here. The other is to align that sequence to every single place in the genome that it could align, given the alignment parameters that you've used. The other option is just to choose one of those locations randomly, which has been done here. Let's say there are 100 equally good places to align these sequences to align elements like this. It chooses one randomly, so these reads, all the 99 other places in the genome look just like this. They have this random scattering of sequence alignments. And then you use the mapping quality to reflect, hey, by the way, I aligned the sequence here, but there were roughly 99 other places in the reference genome that it could have gone. So don't really, I put the data here, but you shouldn't really trust it. Sure, but the signal for that is so minimal because there's so many copies, right? So it's a difference of, maybe it's a difference of 99 out of 100 versus 100 out of 100. Yes, yes. But it's one out of 100 that's distributed among all 100. So it's simply because we know, we're inferring this, right? Because we know the UCSC genome browser, for instance, tells us that there's a line element in the reference genome here. Line elements we know copy and move about the genome through retro transposition through a copy and paste mechanism. So when they move, they don't exact themselves from the genome, they copy themselves. So therefore we infer that because it looks like a deletion in our experimental genome, and there's a line element here, it probably wasn't actually a deletion in our experimental genome, it was really just that the reference genome has it, we don't. The main reason for thinking about that is that you might interpret the functional consequence of this thing that looks like a deletion very differently than if it overlapped five exons of a gene, and it looks like a real deletion. Look, there are certain sequences that are used between each one. Right now, as you can see, basically something has the same other reference gene condition. Right. So you want to compare multiple strains, not to the reference genome, but comparing. Right, so well, I mean one thing you could do is align all of them to the reference genome and then look for differences in the events in the individual strains because you could call rearrangements in each strain individually and then compare the sets to find ones that are private to one strain versus the other. Does that answer your question or no? Maybe we can talk afterwards and come up with a solution. So the last two are a little more esoteric, so we'll go through them fairly quickly, but just like you can have retro transposition events that are in the reference genome and not in the sample, you can also have retro transposition events that are in the sample and not the reference genome. So the way you detect this, this is something that Michael worked on when he was a Ph.D. student and the lab that we both came from works on this pretty actively. The signal for this is you have, so let's say we're on chromosome one, there's all this discordant alignment here and it's all over the place. So there's different colors and these different colors reflect alignments to different chromosomes. So each chromosome gets a different color. So there's something crazy going on here. And what we can find, what we find is if we look at the alignments on these other chromosomes, we can see that consistently those alignments overlap with line alignments, every one of them. It's just line elements in different parts of the genome. So we can use that to infer that actually there's an insertion of a line element in the reference genome here just by the fact that these aberrant alignments consistently align to the same class of line element. So this is one of the challenging parts of structural variation. It's a repetitive sequence that was inserted. So when you align these fragments to the reference genome, one end is going to align in this spot and the other end is going to align to the 5,000 other places in the genome where that line element is found just like the one that was inserted. So there are a couple different strategies for doing that. I came up with one that isn't great. That's what's reflected here. And the lab where Michael and I did our PhD came up with a better strategy which is just to align these sequences to the known list of repeat sequences. And you can identify the same thing far more quickly and far more accurately. So the main point here is we've got an insertion. We have a retro transposition event that occurred in the sample genome and it isn't found in the reference genome. The last case is a little more complicated where you can have retro-posed genes. So the same machinery that allows retro transposons to insert, you can basically have processed pseudo genes that basically insert into the reference into the experimental genome. So you know that but you get these clusters of things that look like deletions. But what you notice is they all coincide with different exon combinations in this gene annotation. What suggests that you actually had a messenger RNA that was inserted into your test genome. It turned into DNA and inserted into the reference genome. So when you align sequences from that inserted event to the reference genome, you're going to get this pattern of all the different exon junctions that align to the reference genome. So the gene was inserted through retro transposition machinery. Okay. Yeah, it's pretty rare. It's just a fun one to look at. It took us a while to figure out what the heck was going on with these. So obviously, and this varies by genome. So this is a mouse example. So retro transposition happens much more commonly. It's still active in the mouse genome. It's probably a very rare event in the human genome. Yeah, exactly. We just know there was an assertion somewhere in the genome, but we don't know where. It's just aligning to the gene whose messenger RNA was reversed, complimented, and put into the test genome. We have no idea where it is, but we have to use, like you said, may pair a sequence in the DRL. Okay. So the last little bit that I'm going to focus on is going through some of the dirty secrets of structural variation discovery, a little bit of cancer work that we've done. And then at the end, I put some slides in around this lumpy structural variant tool that we've developed. I'm not going to go through that. Feel free to look through the slides that explains what it does. In the interest of time, I'll just focus on talking about sort of the dirty secrets of structural variant discovery. And we've kind of touched on some of these problems already. So one is that unlike SNP and Indel calling, structural variant discovery suffers from a pretty high false positive rate. And there's a lot of reasons for this, right? So our signal for structural variants is discordant alignments. The problem is that there are multiple reasons why you can have discordant alignments beyond just the real biological signal we're looking for, which is structural variation. First is that the human genome is very repetitive. And despite hard, hard work to make the reference assembly perfect, there are still parts of the genome that are poorly assembled. So a lot of times what you end up finding as structural variants in these studies is actually not a structural variant. It's just misassembly in the reference genome. And the way you identify those cases is if you start seeing the structural variant in basically every sample that you sequence, it tells you, well, this is actually is not a polymorphism. It's just the reference genome screwed up. Fortunately, the reference genome is very high quality. So I mean, this doesn't happen all over the place, but certainly centromeres, telomeres, and some of the highly segmentally duplicated parts of the human genome are not well assembled yet. Chyneras. So when you're doing library prep and you do adapter ligation and all these ligation steps, you can end up at a certain rate ligating DNA fragments together. And so when you sequence, you just got two random pieces of DNA together, so they're going to align to crazy places in the reference genome. And if you have a high duplication rate in your library, you end up duplicating these chymeras so that they look like real structural events. We require to call a structural event. I said we require a few, you know, several discordant alignments to make us believe it's real. Well, if you have a high chymarism rate and a high duplication rate, you don't have a very complex library. You're going to artifactually get a lot of fake looking structural variants. And so that's where knowing what your DNA library looks like and what the duplication rate is, what the chymarism rate is ahead of time really helps you choose appropriate thresholds when you're calling structural variants. So basically the take home message is that pretty much every paper deep in the methods or in the supplemental methods that look at structural variants, there's a list of filters that they've applied to exclude artifacts. We all do it. It's just simply to get a list of 40,000 structural variant predictions in a human genome down to the 3,000 to 10,000 that are probably real. The vast majority of things that are excluded are events in misassembled parts of the genome. Unfortunately, there's also a high false negative rate. And that's really a function that false negative rate is primarily a function of the coverage you have in your genome, right? So if you to get around some of the false positive issues, if you require say four or five discordant alignments to believe that something is real. Well, you're going to throw, you're going to miss inherently all the events that are real, but only had four or three or two or one discordant alignment, right? So the higher the coverage you have, the lower your false negative rate is, but there's a cost associated with that. And if you're doing this on a whole genome scale, you're sort of balancing how much money you want to spend to get great coverage versus what your false positive and false negative rates are. So the other thing is, it's sort of this irony that the misassembled and highly repetitive parts of the genome are where a lot of the false positives are, but actually these highly repetitive parts of the genome actually are where a lot of the action for structural variation is. So you're throwing them out to avoid misassembly, but you're probably also throwing away things that are real. It's just really hard to tell for sure that they're real. So there's false negatives in the sense just built into the misassemblies and the reference genome. Same thing. When we do the filtering I just talked about for false positives, you always inevitably throw out false or true positives. That's the same issue with the strategies that Michael was talking about, VQSR and all these strategies. You're trying to focus in on a low false positive rate, but inevitably you're throwing out true positives as well. All right. So in terms of false negatives and false positives, it's a real concern for studies like tumor normal comparisons, where maybe you're trying to identify rearrangements that, somatic rearrangements that occur just in the tumor. The way you do that is by essentially by comparing the evidence in the tumor versus the evidence in the normal, but the early studies went for deeper coverage in the tumor to find lots of things, sacrifice money on getting coverage in the normal. So falsely concluded that there was a lot of somatic events in the tumor, simply because I didn't have enough data to recognize that it's actually in the normal as well. Same issue for like de novo mutation studies and trios. Really if you're looking for spontaneous mutation, you need a lot of coverage in the parents so that you can really believe that something you see in the kid that looks like it's spontaneous is really spontaneous. And that goes for structural variants, SNPs and DELs, whatever. We already touched upon this. Structural variants involving repetitive sequence are really, really hard to detect, simply because there's lots of different alignments all over the genome. For instance, these retro transposition events. Segmental duplications, they're also known as low copy repeats. You have non-allelocomologous recombination that leads to big duplications or big deletions, but we don't have, our sequencing technologies don't generate long enough reads or big enough fragments to be able to detect these things. So this is the example I was showing just a minute ago, where you have an insertion of a repetitive element in the test genome. One end is going to align in the local region, and then the other end that reflects the inserted element, when you align it to the reference genome, there's multiple places that I can go to. I just said this, segmentally duplicated regions drive structural variants through structural variation through non-allelocomologous recombination. So when you get these two big elements, they're duplicated, and just, you know, when they're during recombination, they can slip, and you can have recombination between this element and its non-alleloc element, a highly identical adjacent element, leading to either a deletion or a duplication. And it's just we don't have, these elements tend to be at least 10 kb in size, so we don't have insert sizes from paradigm sequencing that are big enough to span these rearrangements. So really, the novel breakpoint is, you know, spanning the whole thing. So the last secret is that, this isn't really a secret, complex structural variants, especially in tumor genomes, but there's mounting evidence that this occurs in the germline as well, they generate ridiculously confusing and complex breakpoint patterns that this is something that we just published a paper on this a couple months ago in 64 cancer genomes. The prevalence is very much higher than we thought, and we're sort of working on methods to take these crazy rearrangement patterns, but what you really want to know is what does the derivative chromosome look like, and what genes, what's effective, what's the functional consequence of these rearrangements. So here's a really, this is the simplest complex rearrangement where you've got an inversion that overlaps, well here's the simplest, which is just an inversion, right? So there's two breakpoints, but it's actually just one event. So this is just a single type of rearrangement that has two breakpoints, but here's a case where we've got multiple things going on. So on the top, this is running at a pointer. Is there a pointer? Okay, close your eyes for a second while I figure this out. Okay, so this is the experimental chromosome. This is the reference genome. So basically what we did in this study was we have all these discordant alignments from discordant paradigm signal, and we basically took all this data and put it into a de novo assembler, and what we got out was this assembled contig, that we could then align back to the reference genome, and the whole point of doing this was to line, to identify the rearrangement breakpoints at one base per resolution. So here's a case where there's a deletion of this segment in the reference genome, there's also an inversion, and then there's another deleted segment. So this was in the mouse genome, and this is the visualization, the same thing using Savant, which you're going to learn about tomorrow, which gives, I think, better clues as to what's actually going on. So this led us to study complex rearrangements in other situations, and we had some early evidence of complex rearrangements in cancer, and there's a long literature on this in cancer, but next-gen sequencing allowed us to map these at much higher resolution. So I don't have time to go into the real literature behind this, but there's a new proposed mechanism called chromothoripsis, some of you might have heard of it, it means chromosome shattering, and the basic notion is that you see these complex rearrangements, so these are all breakpoints, and this one part of, I don't know which chromosome this is, but this one segment of this one chromosome arm, 85% of the breakpoints in this tumor occurred right there. So yeah, there's a lot of rearrangement in the tumor, but it's focal, and the pattern suggests that this was one mutational event. It's not, you know, we think of cancer progression as, you know, you get, it's, you know, a selection in a cell, in a cell population, where you have a new mutation that occurs, confer some selective advantage, more mutation, more mutation. All these mutations occurred in the same event, and the idea is that the chromosome shattered, and for the cell to survive, the chromosome had to be put back together, right, through non-homologous end joining, so just essentially randomly these fragments of DNA were put back together, so when you align that chromosome back to the reference genome, you get these ridiculous patterns, and I have no idea how to make sense of this, this is just drawing the data, right. So we can use tools like Circos to draw these rearrangements, so here's basically the same event on chromosome one, and that's focal on chromosome one, focal rearrangement on this end, and these arcs represent rearrangements, and on this axis what you see is changes in copy numbers, so these are losses, these are gains, and so again, I mean, we see that there's a lot of cool looking stuff, crazy, really hard to figure out, but it doesn't really yield much insight as to what the structure of the actual chromosome looked like, and so if you're interested in looking at this a little more, we profiled 64 cancer genomes from TCGA, looking at complex rearrangements, and made some basic conclusions, and I'll come to this in a second, but here are six tumor genomes from glioblastomas on the bottom, squamous cell lung cancer, here's another glioblastoma, and there's another lung cancer, and basically what you see is that these complex rearrangements are fairly common, we see them in a much higher fraction of tumor genomes than we expected, and the main take home, the main result that we found is that glioblastomas and squamous cell lung cancers are hugely enriched for these chromothoripsis-like events, not a lot of insight as to why at this point, but also there was some debate in the field about what are the mechanisms that lead to this, some people argue that stalled replication forks would lead to these rearrangement patterns, the homology at the breakpoints that we see suggests that it's actually non-homologous end joining, which suggests some mechanism for shattering the chromosome and then you're using end joining to stitch the chromosome back together, but what we also saw was that these chromothoripsis events are at a higher allele frequency within each of the tumor, suggesting that either A, they confer some sort of selective advantage, because they're at a higher frequency, or B, that they occurred early on in tumor genesis. So there's a huge amount of work to figure out what's really driving these, what's the mechanism behind causing chromothoripsis, why is it different among different cancer types, et cetera, but that's future research. Okay, so in the last couple minutes, I'm just going to talk about the tools that we use, and then the lab that we're going to do, we're actually going to use the tools, but I just want to give you a really high level sense of what tools there are out there. As I mentioned, there's a ton, so I added on the Wiki at the bottom of the front page, there's resources, tips and tricks or something. One of the things that we'll add in there today is a list of some of the aligners and SNP callers and structural variant tools, but I also listed, there's two really nice web resources, Seek Answers and Biostars, I think we'll cover that a little bit more tomorrow, but here's a URL to this Wiki page on Seek Answers that maintains a list of all the tools that are out there at your disposal for all these different tasks, and I just queried this Wiki site to find, this is probably a third of the structural variant tools that are out there right now. So there's a huge list referred to the first couple slides ahead about what some of the tools that we recommend that we have experience with and trust. The one that we're going to use today is called Hydra. This is something I wrote when I was a postdoc, and the basic idea is that you take these discordant alignments, these signatures that we've learned about, and you look for consistent evidence at a given locus for the same pattern. So it's one thing to see lots of discordant alignments at a region, but if they don't agree with each other in terms of their mapping distance and the orientation of their ends, that suggests that it's probably not real, whereas if you see lots of discordant alignments and they all suggest the same distance, they're all in the same orientation, that sort of separates the signal from the noise. This is a real event. So all Hydra does, it took us a long time to figure out how to do this quickly, but at the end of the day, same one it does now, it sounds so simple. All we do is sweep across the genome, look for regions that have these clusters of discordant alignments, once we find them, cluster them, and just tell you what the intervals are that the predictive breakpoint is in. And that's what we're going to do today. We're going to do that with NA-12-878 genome, and we're going to find all the structural variants that are on chromosome one of this individual. So we also, for the cancer study I was talking about, there's a new version of Hydra called Hydramulti that allows you to take hundreds and hundreds of genomes at once and use mutual information across those genomes to have better power to detect really rare events in tumors, for instance, where maybe there's only two reads in the tumor so that if you analyze that tumor on its own, it wouldn't exceed the threshold that you would require to call the event. But if you see evidence and lots of other samples, it makes you believe that that event is real. It's just the same technology that people use for SNP calling, the same concept. It's just applied to structural variant calling. And then this other tool that I'll let you look at on your own, this is something that we're probably submitting this month. It's a tool called Lumpy, which I'll just highlight it really quickly. We talked about copy number, depth of coverage. We talked about paired-end mapping and split-read mapping, these three different primary signals for structural variation. The reality is that until the end of last year, every tool that was out there only looked at one of those signals. And so we talked about low sensitivity or high false negative rate. One of the main reasons for that is most of the tools out there just didn't use all the signal that was available to them. And the reasons for that are because there's some technical challenges to doing that. That's what this framework that we developed solves so we can integrate split-read mapping, paired-end mapping, any signal into the same probabilistic framework and the consequences that we have much higher sensitivity. So we find more of the real events and we don't increase our false positive rate. So we're pretty excited about that tool first. One of the ongoing projects in our lab is studying complex rearrangements in cancer. So we need good tools to find these events. Okay, so with that, I will end. The last slides are some more details about that tool if you're interested.