 So good morning, and so in this lecture, what we're trying to do is talk about larger-scale variation events in genomes, such as the human genome, and to sort of illustrate the methods that are used to detect these. So Michael did a great job, but yesterday I was showing how SNP detection works and how important it is to get relatively high-quality alignments in order to be really good at finding SNPs. And today I'm sort of going to try to do the opposite and say, well, alignments are not nearly as important when you're looking for the larger structural variation event, although you still need alignments. Their quality could be a little bit sketchier. So I guess to start with just one slide overview of things which you probably already know, either from before or at the worst from yesterday. So there is a whole bunch of genomic variation which underlies the differences between all of us. And this variation comes in many different sizes from SNPs, which occur to one in a thousand positions on which can be detected by comparing a read to reference genome and finding things that look different. Ooh, we got a new laser pointer today. Two larger-scale structural genomic alterations. And these can be things like insertion, conditions, inversion, transportation, and changes in coffee number of sequence. So what are these structural variations? Well, these are events usually which can be seen on a chromosome level under a microscope basis. Although people have been sort of, as the detection ability goes up, so is the definition has also been expanding to include smaller and smaller events as structural. And people who have been working on structural variation for many years, well, you're finding this 300 base pair thing. So it's not a structural variation event. Only a thousand base pairs are more structural. Only 10,000 base pairs are more structural. In reality, there's obviously a continuum of events from one base pair to millions. And there's really no single sharp column which you can talk about. Basically, to me, structural variations are anything which you can't find just by mapping a read. You take a read, you map it, you can find a snip. You can find a two base pair insertion, four base pair insertion. But if you're dealing with a read of length 30, you can't find a 10 base pair insertion. It's because the read will have tons of locations where the two parts are spanned by a 10 base pair gap, just the statistics are against you. So these are some of the examples of structural variations and how they can be connected to various kinds of experiments, either by re-looking the chromosome by the inversion, which is an inversion of the part of a chromosome nine, or using fish where basically it's fluorescent to label chromosomes and then take a look at them. So this is the common allele you can use, green, yellow, red, but there is also an inversion where yellow and green are in the opposite order. So this is a rather diverse amplification. For one chromosome, you have one copy of a certain sequence, and for another chromosome, you have two. It can also be something like 10 copies versus 12 copies of a sequence. So these are copy number variations. And in general, so people talk about structural variation, and people talk about copy number variation. Or people talk about insertions and deletions, and talk about copy number variation, as though they're two different things. Well, in reality that's not true. There are copy-neutral events like inversions or transpositions, when a piece of DNA just moves to a different location that you know. But most events, insertions, deletions, insertions and deletions are also copy number variations. Deletions trivially, if you delete a piece of DNA, well, that's a difference in copy number. There used to be one, now there is zero. If you insert a piece of DNA, well, the genome is pretty bad at inventing new sequences, so usually you copy it from somewhere else. So it'll create a copy number of the sequence which was over, which you copied, which you ended up inserting. So the two of them are really two sides of the same thing, and people will say, well, insertion, well, insertion is really a C and B most of the time. So when I explicitly look at copy count, I will use the term C and B, and when I'm just looking at something happened, but I'm not sure what, I will call it structural variation, then we'll say there's an insertion, but I'm not really sure what was inserted. So, oh boy, okay. I'm going to blame this on Microsoft again. This should be on one block, but so what does a insertion look like? Well, you have a donor genome, and all of these will have a donor genome. This is a genome of some individual in your sequencing, and the reference genome, and that's what's located at NCBI or your CSC, which you can download. And when you compare them, there's an extra piece of DNA in the middle that's an insertion. Similarly, deletion is when the donor genome is actually missing a piece of DNA. In versions, you take a piece of DNA and remembering that they're double-stranded, you go and flip it and insert it back. So these are the common structural variation. How do we actually find them? Well, there are various available techniques from web lab to sequence analysis. And one of the most popular is comparative genome hybridization, CGH. And what you do is you basically put your make a slide, make a microwave, which will hybridize various pieces of DNA, and you put DNA from an individual and possibly a full of individual and see which one, fluorescently label them, and then see which one is more expressed. So it has more copies in the pool. So it will be, you know, go red, with one, green, the other, and yellow, more or less that. So that's one of the most popular methods. This will tell you things which have changed the copy number as long as the change is relatively significant. This will not be able to tell you 10 copies versus 12 copies. Just beyond the resolution of the microwave. A more, this also cannot detect inversions and translocations, or a piece of, these are copy-neutral variants. So when there's actually no underlying change in copy number. There's also a little pitch, for instance, new hybridization, which is basically an experiment to first label DNA and actually look at it. Unfortunately, this is very time-consuming and expensive, and you have to really know what you're targeting. You can't just look genome-wide for variants. There are also two sequence-based techniques. The best one, and the one which is sort of most effective, is direct comparison of genomes. So if you're a really famous biologist, possibly you have your genome already assembled. It's very expensive to assemble the whole genome. With the current technologies, as I pointed out in my previous lecture, you really can't put together a genome from scratch. You usually work off of a reference. So whenever you do this, if you're just doing assembly off of a reference, it's very difficult to assemble areas with structural variation, where there's large-scale alterations between individuals. So let me tell you the direct comparison of the venture genome and the NCBI reference genome to find structural variants. Unfortunately, that's not going to be practical, as we're sequencing thousands of genomes with short reach. However, there's sort of an alternate technique, which is using unassembled, made-pair, or paramand data. And I'm going to be using the terms completely interchangeably, because from my perspective, there is no difference. From the perspective of sort of the algorithms. So using makers to detect structural variations is much cheaper. You don't need to assemble the whole genome from scratch. And next-gen sequencing technologies make this even more attractive, because you can get really, really high paramand coverage, very, very cheap. So how exactly does this work? So detecting structural variations of makers. Well, first, just through, I guess I shouldn't have had the slide in here. You already all know what made-pairs are. You've got fishy coincidences. There's some insert size. You know nothing about what's in between. And now let's make, for the biologists in the room, you will say, this is ridiculous, but let's assume that the insert size is perfect. Let's say we know exactly the distance from here to here. What can we do with that? Well, it's pretty easy to detect structural variants then, because if you have two reads, they're sampled from donor genome, you map them to the reference. And if the map distance, the distance between these two reads on the reference genome is the same as your expected insert size, well, you know that nothing happened. Nothing, no insertions or deletion events at least happened in between. However, if there was an insertion in the donor genome, then the distance at which the map will be significantly smaller. And furthermore, the size of the insertion you can estimate is insert size minus map distance. Okay? Pretty straightforward stuff. This, by the way, makes the assumption that there's only a single event in between these two meters. It's a somewhat reasonable assumption in most cases, because these events which we're looking at are relatively rare. We have seen cases where that's not true, where there's multiple events actually happening between the pairs. Michael, just wondering, what happens in the case of automated or PDN, I mean, there's an intrusion in those kinds of cases? Okay, so this is, I'm talking strictly about DNA sequencing. Okay. Do people use similar types of approaches for RNA for finding joint, for finding transcripts of, you know, fuse transcripts of various genes? Also, yeah, you can use the same, similar approaches for alternative splicing, but, you know, in this, actually I'm really concentrating on DNA sequencing, not RNA sequencing. By the way, everyone, ask questions, please. We have plenty of, you know, to go through this reasonable clip, not just cross-through. So, the second thing is, well, let's say we now see a mate pair, and we see that it supports some event that claims that there's some insertion happening. Would you trust that mate pair? Or is that only evidence that you have? One mate pair. Why not? It can be by chance. I mean, either from bad mapping, or also it could be that there's actually an alteration of the DNA during sequencing. There's, you know, there could have been a recombination event which happened during the sequencing process, which wasn't in the original genome. So, no, a single read, you definitely would not think it would support, would support an event. It comes down to the fact that whenever you're looking for very rare events, you really want to be confident and even small false positive rates blow up in your face. For example, in the example of, let's say there is a rare disease which, you know, affects one out of a thousand people, and there's a test for it which has a 1% false positive rate. If you get tested and you come out positive, what's the probability that you actually have a disease? 10% because, you know, if you test a thousand people, 10 will fall, will test positive, which only one will have a disease. So, when you're looking for really rare events, you want to be very confident. So even small fractions of errors will blow up in your face. So what you really want to do is you want to have multiple made pairs which will support some event. And this is explained by a concept called consistency of made pairs. So, two made pairs explain the same event. We call them consistent in that case. If the size of event explained is the same, so the size of insertion by four X, the made pair of size is the same as Xj, and they overlap. There is actually some sequence in the middle where there's space to do an insertion. Similarly, you can define the same rules for inversions, although it will become a little bit trickier. So imagine that there's a made pair which maps into an inverted region. Oops. This is how it will map to the reference. Notice that the orientation will be wrong for the second read. But what about the distance between them? Well, actually, multiple inversions could create made pairs with the same distance. Take a look at this. It could be as small as this. It could be right here where the size of inversion is always smaller than M minus where M is the map this is right here, minus the insert size. Or it could be as large as this. Or it could be smaller than M plus the insert size. M plus the insert size. All of these inversions could be explained by the same made pair mapping. So there's a range estimate that you can come up with. The size of the inversion is somewhere between M minus insert size and M plus insert size. This becomes very important. How big your insert size is. Obviously. Your insert size is very small. You can detect smaller inversions. So you can be more confident about your estimate to the size. Furthermore, you can define similarly the concept of consistency. Where these two made pairs are consistent. They consistently explain the same inversion. If this map distance from here to here and from here to here and this is actually a little bit hard to see but I'm actually looking at the distance between the reads of different pairs. So it's this distance where these are from different made pairs and this distance. These have to be the same for the two made pairs to be consistent. And also the range of the size of inversion explained by one made pair overlap the range of the size of the inversion explained by the other made pair and they have to overlap. Meaning this. So inversions actually create a rainbow-like pattern. Where if in the case of insertion things overlap partially, it was like a chain. Here it's a rainbow. So that's all good but as we all know, made pair sizes are not going to be perfect. There's actually going to be quite a wide deviation between the smallest one that you'll get and the largest one that you'll get. So this is just a diagram of observance of observed map distances made pair from a classic paper from 2005. And what they did with Sanger data they looked at these and said, well here there's going to be structural variance more than three standard deviations away from the name. And things which are here are going to be structural variance but things in the middle are not structural variance. And while the question is why are made pairs which are right here and better than made pairs which are right here. And the answer is there is really no good reason. It's just an arbitrary color. And also by doing this the mean was around 40,000 and they went to 32,000 this way, 48,000 another. They were unable to detect any event which is less than 8,000 base pairs. That's just the limits of the resolution. So in their case they had a 40k mean and 2.8k standard deviation. One thing to notice is that the distribution is completely not Gaussian or does not have any other sort of nice shape or form. People sometimes say pretend that distributions of insert sizes are Gaussian. That's not true pretty much. I haven't seen one which is Gaussian. I've seen some which look absolutely horrible which have really, really fat tails and look more or less uniform up until they sort of slowly drop off. I've never seen anything which has been remotely resembled a Gaussian. This is as close to Gaussian as you would get. Well, why is this all important? Well, what if you really want to detect these smaller insert events? Can you do it? And in the case of Sanger data it turned out that you really couldn't. You were really limited by the fact that you were pretty hard cut off. You couldn't go too much further in. But with high throughput sequencing data, you may be able to. So why? Well, first of all, you know, what if you have a small variation between the observed insert size of a mate pair and the map distance? So the expected insert size of a mate pair and the map distance. This can be due to two different things. It could be, well, this is just a natural variation and this maker got sampled from further on not from the exact mean of the distribution but a bit closer to one of the tails. Or it could be so that noise, or it could actually be there's an underlying small insertion. How can we tell which one it is? Does anybody have ideas? No? What data do we have? Mate pairs. So, ah, if we have a lot of evidence then we should be able to do better. Because that's a hope. So in reality, especially with high throughput sequencing data, you will have very high clone coverage. Or relatively high clone coverage. You will have for every single potential event many mate pairs which span it. So all of these should be mapped to the reference. And this is what we define as a cluster sort of all of the mate pairs which span a certain location on the genome. Well, what will happen if this area does not have an indel? You expect to see some observed distribution of insert sizes. And if, in fact, this area has no indel, the distribution which you observe should be equal to the expected. Your whole distributions should match. What if there is a 20 base pair insertion? Your whole distribution should change. And if there's a 20 base pair deletion the distribution will shift in another direction. So in reality you're looking not at a single mate pair at the time, but you want to look at all of the mate pairs which could potentially explain variation in the given genomic region. So, can we somehow quantify what power this gives us? Well, it turns out that we can. There's this wonderful thing called the central limit theorem that you probably learned in college and completely forgot since. So, if you have a whole bunch of independent random variables, and these are things like coin tosses or, you know, dice rolls, with each one has some mean, u and variance, sigma squared, sigma is the standard deviation. The mean of all of these will follow the Gaussian with mean, u and standard deviation sigma divided by root n. So, doesn't matter what the cool thing is, it really doesn't matter what the underlying shape of the distribution is, from which you're sampling your random variable. It can be, there's not to be Gaussian, there's not to be any kind of defined distribution. Any distribution will do. And if you're sampling things from it, the cool thing is your standard deviation, if it was sigma for the original distribution, it will become sigma over root n as the number of samples grows. So, your standard distribution, standard deviation will get smaller and smaller as you get more and more samples. And this is exactly what we have. We have different random variables, our sizes of indels supported by each maker. And the mean of all of these is going to be normally distributed. Sorry. The mean of all of these is going to be an element which is randomly drawn from a Gaussian distribution with this mean and this standard deviation. So, if there is some underlying indel size, we're going to get a sample from the distribution which is around it. We won't get the indel size precisely, but we'll get a very good estimate. Does this make sense? Some of you, hopefully. Furthermore, what this says is that you can actually compute p-values. You can look at the proposed event and compute how likely is this to occur by chance. Well, this is exactly the probability that I sampled the observed mean from a Gaussian which is centered at zero, indicating that there is no indel. Well, actually, it's not that probability to fail now. It's failing to go. This is exactly, this gives you how likely is it that I observed this mean if they're actually underlying there was no indel. So, it's a great question. Let me let me get back to it in like another five slides. So, the question was, how do I know that I'm actually getting made pairs which span the symbol and turns out the answer is that we do a lot of work to do the trial possibilities. So, okay. So, what is the key thing in all of this analysis that I have sort of left out? And so, this is great, right? It gives you a nice probabilistic framework for detecting smaller events. You can get p-values. You can be very careful about your size estimate. There's a little bit of biology that that's completely left out. Sorry? SanctiSense Inversions are not so, inversions we actually this is not the way to do it. Inversions you just do using looking at specifically made pairs which have opposite or mapping on the other strand. So, this is just for insertions and deletions. Yeah? Well, it's not going to be constant but it's going to be... We're assuming that there's a single library so there's a single distribution. Yeah. So, all of this is great but in reality we have two copies of every chromosome. So, the real world situation actually looks like this. You have two chromosomes of which potentially only one has an insertion or deletion. Well, that's just the haploid case. But in this case your heterozygous insertion will have two chromosomes which will map and will give you a cluster with two variable insert sizes. Hopefully one of which will match the reference and one will not. You could have a trilelic case where neither actually matches the reference. So, yeah, the observed distribution of map distances from donor chromosome 1 which could be equal to the distribution of the insert size but there's also a distribution of map distances from donor C2. So what you're really seeing is a mixture of two distributions where the size of the insertion is probably going to be the distance between the peaks. So what you really need is these underlying means. Unfortunately, you have no way of knowing which mate pair came from which chromosome. How can you do this? Well, it turns out that you can do this by applying a little bit of smart computer science, something called the expectation maximization algorithm. And really this is just a bootstrapping procedure which tries to estimate which mate pair came from which distribution. You start with two random distributions and maybe they look like this. And for every single mate pair, compute, assign it to the two distributions with the probability that they were generated from the two distributions. So for things which are on the left, they're much more likely to have been generated from this distribution than this one. For things which are on the right, it's the reverse. And for things which are in the middle, they could actually have been generated from either with some probability. So you do this assignment and now you get a whole bunch of blue mate pairs which were generated which are now assigned to the distribution Mewtwo. A whole bunch of red ones which are on the first one. And now you update. You try to move the distribution so that they best match the mate pairs which were assigned to them. So you will shift this distribution in this way. And then you will iterate this process. Again, assign the mate pairs to distributions, fit the distributions to the mate pairs, and so on. This is a classic approach in computer science called expectation maximization. In order to update, we use something Komogorov-Smirnov statistic. It's a way of basically measuring distances between distributions. And if you are really into this, I can explain why we use Komogorov-Smirnov rather than something else offline. I don't have any reason to use this picture rather than anything else. Nope. You do not assume anything about you assume that you know the shape of the distribution, but this is an empirical distribution. You just... Mark Curse. We know the shape of the distribution and we know there is standard deviation. We don't know where the shift is, but we have some distribution which looks like this. And then a second distribution which is the exact same one, but looks like this. I mean, at any single point, let's say we have an element which is sampled from here. You can say, well, this has high download 2 and this has high 50. So probability 2 out of 52 came from this one and probability 50 out of 52... ... Well, no. Yeah, so the distribution shape is known. This is the distribution of the size of the insert. This is the empirical observed distribution of the insert size. Sorry? So we assume that so the sampling is with replacement. So because we sort of think that there is an infinite supply of DNA, this is not completely true, but there this whole procedure. So once you actually see where we're on this, you will see why it's very... Each individual step is not very intensive, but to do it over the whole genomes. Yeah. So basically the key is we assume we know the shape of the distribution. But we don't assume that it's any kind of normal closed form or anything like that. Some empirical observed distribution, which we just get by binning things. Yeah. So this is basically taking observed data and separating it into distribution is actually the most algorithmically complex portion of this work. And, you know, for example if you're really interested in this, we'll be glad to provide you with our code for it. The code for this is publicly available. But in general, there's a whole sort of statistical theory when you have observed data, which could be coming from a mixture of a whole bunch of distributions. How do you actually do this? It's originally was based on mixtures of Gaussians, but you can generalize it to any kind of distributions. So here is sort of the method we developed called modal, which is mixture of distribution in the locator. And this is how you actually do this. So you start by mapping the rates. And, you know, we are very agnostic about what methods mean. And then this is where to answer the questions of both the computational intensiveness and how do you make sure that you have only, you know, sort of, you know that there's a single that you're looking at every event. Here's how we define the cluster. We have a whole bunch of mate pairs which have been mapped. So a cluster is the end at the end of every single first street of a mate pair. This is a cluster. And this is a cluster. And this is a cluster. And we run this mixture of distribution e-m procedure for every single one of these. And this is what actually gets computational intensive. It's not any individual run. It's the fact that we do this basically as many times as you have mate pairs. Then we do this we have some framework to take mate pairs which are assigned to non-unique locations and to assign them to unique ones. And I really don't want to get into this. And then we run e-m for each cluster. At the end we do some post processing. We compute p-values for every single cluster and apply some global correction. We usually use full discovery rate. Merge duplicates. Because of the way we did the clustering, every single indel will be predicted multiple times at multiple adjacent locations. So we go through and merge these back together. And as well for every single event we compute the probability that heterozygous. I didn't go into this but you can do this based on the observed distributions. You basically look at the two distributions and say can I really be confident that they're separate or could they or is it likely that they're actually the same? How well does this work? Well, to start to off with, it's always nice to do a little bit of simulations to make sure that you are doing things which are reasonable. So we took, there was a study by Nils Adal a few years ago which took all of the available Sanger data pretty much from every single individual that was ever sequenced at that point, mapped it to the human reference genome and found indels which were supported by at least two reads from any individual. So this is a really good set of real indels. If you start, just take a real human chromosome and start putting random indels in it, randomly inserting letters or deleting letters well you are going to get a much easier case than the real thing because usually you will many repeats happen in already repetitive regions. So it becomes easier to find if you just randomly insert and if you take real ones. So we took the real ones 51 million indels generated from 51 million maters that were mapped to the human chromosome one. Sorry, no, never mind. Took the real known indels and inserted them into the chromosome half heterozygous, half homozygous and generated 51 million maters with the expected distribution of insert size and looked at how well our method does at finding these finding and we measured two things called precision and recall and you can think of precision as sort of like a specificity and the recall is something like a sensitivity in reality they're defined slightly differently but basically for insertions greater than and deletions greater than 20 base pairs were 85 to 90 percent on both accounts. So 85 percent of the true indels we could find 85 to 90 percent and 90 percent of the things were found were true. And it's interesting to see that the numbers drop as you go to smaller and smaller indels because you can no longer be nearly as confident about your about the fact that two distributions are distinct from zero from the not from the hypothesis, yeah. 22 zero for insertions it's 20, 20, so whenever we talk about these we define assertions and deletions relative to the donor. So an insertion is the sequence which is present in the donor and not present in the reference. So insertions is 22 anything deletions is 22 200 deletion is 22 insert size insertions is 22 anything because insertions if you have an extra piece in the donor if you have an extra piece in the donor it really doesn't matter how long it is because sorry never mind, insertions are to the insert size yes and deletions yes thank you because if this is greater than the insert size you will never be able to span it hence you will not be able to see it when you map to the reference if the insertion is a replication wait till the next section looks like of the lecture so yeah 51 million acres wasn't very long we did all this in the cluster so this was just a few hours the we did this for the whole so wait I'll answer the question which you want to talk so so this is and it drops off and basically below 10 base pairs the method can't find anything we also did a comparison with another method which was just published for structural variation detection which looks at just the main pairs and it turns out that we were exactly the same for indels greater than or equal to 40 base pairs but between 20 and 40 they found absolutely nothing while we had pretty much the same performance down to 20 base pair insertion deletions for them 40 was basically a hard color and you know MAC and other sort of snip tools like that but at least I don't know about mosaic but with MAC certainly we haven't been able to find anything greater than 10 base pairs this is 35 base pair reads right yeah so this is basically and basically I think the two sides are really going to merge and they're already merging because as the reads get longer you can find longer and longer indels just by mapping and the main pair methods are sort of taking care of the rest so you know I think two methods probably have a radio overlap with 72 base pair reads you can probably get above 20 base pair indels and methods like mobile can get it down to below 20 so this is a question that you really want to ask we ran this on NA18507 which is a human individual which was published in the Bentley paper a year or so ago and all of the high throughput sequencing papers now use this as sort of one of the standard benchmark data sets there's 40x aluminum read coverage the main pairs were 208 base pairs plus minus 13 standard deviation this we ran this through modal it took something like three days on our 200 core cluster so a while sorry pretty good it's a computation intensive process there's not much you can do about it oh never mind there is a lot you can do about this my guess is that we'll be able to speed it up about 10 fold and the reason is right now it's all written in python so for those of you who don't appreciate python it's probably the slowest language known to mankind so yeah so we're right now actually porting it into something called Scython which is a optimized version of python and expecting something like a 10 fold speedup so for this individual there was a previous paper where they looked at a small fraction of found a small fraction of windows basically they were doing Sanger sequencing also pair in in order to find big structural variation but in the process by mapping the Sanger reads they could find small indels right there they were looking for big things but also got the smaller reads got the smaller ones where the read map but the coverage is very small there was a 0.3x coverage of the genome so you expect only a small fraction of the true indels on the other hand you do have some data and what we looked at in this one well great this one is just not showing up at all so for greater than or equal to 20 base pair indels we found 95% of the indels which were in the kid data set it dropped it was about 70% for 15 to 20 base pair indels and this we don't show it but we missed 70% so again you can see greater than 20 base pairs break 15 to 20 and K sucks beyond that point you know that's I mean that's this is what you get another thing which may be interesting well we don't actually observe the insertion size insertion we just get sort of instrument somewhere in there and here is approximately how big it is how accurately can we predict the size so what we did is we took this milk data set and took the insertions which we found in a different individual and compared them to the one we found in the milk's paper and when I saw this correlation plot I sort of wish that all of them looked like this this is the predicted modal indel size versus the observed milk indel size for overlapping indels and yeah this is as good as they get our score is 0.96 so at the same time you're saying well there is a bit of variation you're not getting it perfectly spot on but there is a reason for it and this goes back to the central limit theorem remember we're not getting the true indel size we're getting an estimate of the indel size which should be normally distributed Gaussian distributed around the true indel size so let's compare that subtract one from the other this is what you get the gray bars and in black is a the Gaussian distribution with the appropriate standard deviation and two are basically identical so math really works this was when I saw this I was like yes this is spectacular and half my group are former theory students and they were like why are you surprised it's math so I don't know okay and finally I want to get into copy number variance and this you know I defined structure variance are and yes so the slides for that are we're initially inserted into the one for your folder so it should really be this one so copy number variance are defined as the way I define them is areas where regions that appear a different number of times within different individuals so for example you know it could be a duplication where this element got say retro-transpose into a different location in the genome could be a division where this area is missing in the other genome and in this case we're not going to be we're mainly estimating the copy number and not so much the actual sequence of events that happened we just want to note and this is important for things like you know figuring out the dosage of genes how much if a gene has a higher copy number possibly there's a higher dose but that's actually getting transcribed so these are potential copy number variance and in general CNDs have been associated with diseases in schizophrenia, cancer other kind of psychiatric things and in general so in this case what the input is is a reference human genome and the sequence donor genome with just parent data coming from the donor and what we want to see is copy number variance annotated in the reference finally regions of the reference that have copy numbers so and here what we're going to do is again you know use a little bit of computer science but we're going to use computer science which you guys have already know so we start by building something which is called the repeat graph so the way repeat graph works is you take regions of the genome which are similar to each other and you merge them together into single notes so here you have red blue which is similar to something else and green which is similar to something else over here which you can't see you build initially a graph where every single one of these elements is separate node and then you take the two red ones and you merge them into a single note so this is something called a repeat graph does this look at all familiar to you? this is basically the drawing graph from last lecture except for we have no longer split things into k pieces we just look at maximal matching pieces this is pretty much identical to the drawing graph the cool thing about the repeat graph is that a walk on the repeat graph should spell the original genome so you start with here again to the red node then into the blue node this black area then back into the red into this black green and out so if you build a repeat graph you can always take the original genome and find it as a walk on this repeat graph however what we really care about is not the reference genome we do this to the reference we care about the donor genome the genome which we actually sequence and how it's different from the reference imagine that the donor genome has like an extra copy of this blue element which is right here right before the green what evidence do we have how it's different well we got made pairs so there will be made pairs which go from this blue region to this green one and when we look at them from the perspective of the reference this will be out of whack made pairs it will be way too far apart and more if you consider a cluster all of these made pairs which span this break point they're actually going to get a whole bunch of them and this actually is a signature that in donor genome this blue segment will be followed by the green one this cluster is indicative of this so you can take this your original graph and add an edge from the blue node to the green one and similarly you can do the same thing on the opposite side of the blue so you'll find this cluster and you will realize that from this black region you can go to the beginning of the blue one so from here back to here so this is something which is called a linking signature it's basically using made pairs in order to say you can go from here to over there and this goes back to the question which I was asked earlier what if the insertion actually is present somewhere else you actually get the matching linking signatures into the inserted area and back out yeah it's a very good question so for this we actually we put them in every single place because it's better to have a false edge to have a false donor we call these things donor edges these are edges which are not in the original repeat graph but we infer them from the donor having extra ones doesn't hurt us nearly as much as missing ones so in this case we actually place them in every single location there is so the next stage of what I'm going to talk about we actually correct for the fact that we are going to get over representation that will be an important consideration so we capture the donor adjacent and you build this thing called the donor graph which is basically the repeat graph with a whole bunch of edges which indicate adjacencies in the donor and the cool thing about this graph is that not only is the reference a walk on this graph just as it was in the original one you just ignore the donor edges but also the donor can now be the donor genome could be generated by walking this graph we start to finish so this walk would correspond to the donor genome we can't actually reconstruct the donor genome that's would be de novo assembly but what we can't look at is how many times do we expect a walk to go through a node which is an easier problem these will be our copy number variance so in order to do this we rely on something called depth of coverage and this is this is related to the fact that areas of the genome there are presence many times in the donor will be sampled many times if they are presence more times in the donor they will be sampled proportionally more times how much more well consider a segment of the genome well everything is missing from the slide now this I am definitely debating my microsoft this is absolutely nothing to do with it there is a whole bunch of formulas on this which are now completely gone I am sure all of you are glad to hear that but basically what it comes down to is when you have a segment of the genome you can try to estimate how many times it is actually present in the donor by seeing how many are mapped if it is present twice as many times in the donor you should see twice as many in practice you cannot get the number exactly but you can get a probability distribution which talks about how likely is it that this genome was present in the donor two times one times three four five and so on and this probability distribution is described by something as a distribution because this is described by a saw in the rival process this is I wish I could scare you with a formula unfortunately Michelle probably on purpose to fill these formulas out so when we actually want to call the C&Ds we actually want to try to find the path which best matches these distributions so when we want to go through this node this edge we want to go through it hopefully two times we could tolerate one or three but these are less likely so we prefer to go through it twice well so what do we have now we have this graph we know for every single node how many times it was present in the reference genome we have these arrival rates based on the depth of coverage we may get something like well this should be which goes through this 0.8 times 2.3 times 2.6 times things which are not very integral and may not actually match any given walk but it turns out that we can find the path through the donor graph which is most faithful to the depth of coverage which most closely matches these numbers using something called network flow with convex costs and you basically get a single path rather than through this not path but you get a path there could be multiple paths with the same property count and then look how many times it goes through each node this one it went through once these two it went through twice these two it went through once and here it was present once in the reference but twice in the donor so this blue area is a copy number variant so this was a probabilistic there's a probabilistic model to score faithfulness it's maximum likelihood and there's a network flow to find the most likely walk sort of it's elegant computer science you don't need to worry about the details this actually turns out to work very well to come the we did the exact same genome got 10,000 cnd calls slightly more losses and gains and by comparing we compared the two previous results so for example did it all published insertions and deletion so things which kid called deletion are almost completely covered by losses predicted by our element he had actually pretty big calls so sometimes we actually have some gains overlapping his calls as well so I don't know if these are edge effects or why exactly this is happening but almost everything he called is defined as deletion we had the loss overlapping and just to give you an idea by shuffling our calls by taking our calls and moving them to random places in the genome we get nothing close to that kind of coverage it's a very significant similarity so so we also compared to great the lectures phone goes off we also compared the the results to database of genomic variants which is curated in the jason building actually in this physical building at the sick kids hospital at the laboratory at Steve Shearer's lab basically almost everything we find overlaps with either a loss or a gain in the database of genomic variants the problem is we can't really tell losses from games because they're doing losses and gains relative to a pool of individuals or good depends on the study but usually relative to a pool while we're doing it compared to some reference genome rather than to a pool and again after shuffling database of genomic variants actually covers relatively large fraction of the genome we get some overlap between randomly shuffled clothes and what we observed but it's nowhere near the overlap that we got okay should have done this earlier so and then we did a comparison to a third study but I will not dwell on this so take home points from all of this you know model you know by taking advantage of high clone coverage you can find progressively smaller and those with high accuracy including 90% or more accuracy for individuals greater than 20 base pairs from you know we require about 20 clone coverage of the data set in these you can combine parent and arrival rate information to find copy number of variants and again you see good concordance of previous results but really the really single take home message for all of this is for finding large scale structural variations the key is parent or made per data these are key to both to finding these events and also their accuracy is key to how well you can detect them so the length and distribution of intertides is extremely important if it's too small you cannot catch some of the events if it's too large you're likely to have a very large variance which will prevent you from finding smaller events even with an approach like modal because there what you actually there's two elements the accuracy is defined by sigma divided by root n so if you get more of them you get progressively more accuracy more if you get more makers you get progressively more accuracy but at the same time there's a higher deviation that actually that goes against you and in all of this the key is read length is not nearly as interesting for all of this I've actually never looked at the underlying mappings I just care that I can take a read and map it to the genome so 36 base pair reads are perfectly fine for this study for these studies you don't need 72 base pair in order to probably capture the rest of it you will need to go somewhere between 72 and you know 72 so you can capture slightly bigger indels but you really don't need to go to 200 long reads to find most of the events here although one thing that we actually don't do very well is finding the exact break points of variance and for that you will in longer reads would be very useful okay so I want to sort of acknowledge a whole bunch of people who worked on the stuff that I talked about both today and yesterday so soon Khali Jaan Al Khan worked on the structural variation side of things for a method of work on assembly for a while in my lab and now I was working on talking number of variations together with Marc Hume Tim Smith and Adrian Dalka and actually Tim you are going to meet relatively soon he's going to be here to talk to help run the lab on assembly because he is the author of IDIR which is our color space assembly tool which you guys are going to play around with soon okay that's it I'll have to take questions