 Okay, so this will be shorter of me just standing up here talking. Just about 15 slides. And what I'm going to talk about is how we can find rearrangements using these readalignment methods that we explored in the first part of the morning. So there's a lot of different categories of variations. So there are single nucleotide variants, like just single bases that have been changed in the reference. There are short indels, which we usually define as being indels that are less than the read length. So these are pieces of sequence that have been inserted into the reference or deleted from the reference. And then there are larger structural variations, like large insertions and deletions, pieces of the reference that have been inverted, and then translocations where whole arms of chromosomes could be swapped and also copy number variation where some sequence has been amplified or deleted. So in this module, we're just going to deal with structural variation and particular large insertions and deletions, inversions and translocations. And then in subsequent modules, like tomorrow, they'll cover single nucleotide variants and copy number variation. So there are two different ways of doing paradigm sequencing. So the method that I described in the morning was doing, which we refer to as paradigm sequencing, where we take fragments of DNA, which are fairly short from about 200 to 500 bases. We add sequencing adapters to each end of the DNA fragment, and then we sequence inwards from the ends. And we get this sequence where one read is mapped to the forward strand and one read is mapped to the reverse strand. We saw that this morning. Now, this gives us quite short-range information, and there's a method called mate pairs, which allows us to do much larger sequencing of pairs. So in this way, we take longer fragments of DNA, which are multiple kilobases, say 10 kilobases in length. We make a circle from the DNA, and then we shear the circle randomly, and we pull down the piece that has the adapter in it, and then sequence the ends of those fragments, and that gives us much longer-range information. So if we sequence the ends of this fragment here, where the adapter sequence, which we used for the pull-down, is in orange, that corresponds to the very far ends of this circle of DNA that we started with. Now instead of having local information, which is 200 to 500 bases, we have much further distance on our reference genome, and that makes it easier to map and easier to find very large structural variation. But this requires extra library prep. It requires extra sequencing. So the paired-end version, where we just sequence short fragments of length around 200 to 500, is much more common. So I'll be describing these rearrangement finding algorithms in terms of just normal aluminum paired-ends, where we have one read which is pointing forward, one read which is pointing reverse. So this just describes the expected orientation of our pairs. As I just said, one reads on the forward strand, one reads on the reverse strand. There's some sequence in the middle that we didn't sequence, we don't know about, and there's an expected insert size. So some people asked about BWA's flags of what a proper pair is in the last part of the module. So what BWA does is when it maps all of these pairs to the reference genome, it samples alignments to try to learn what the distribution of fragment sizes are. So usually this is a normal distribution, or at least it can be approximated with a normal distribution. So we'll try to figure out what the shape of this distribution is and what the mean insert size or fragment size is for the library. And it uses this information to determine whether the pairs are as expected when mapped to the reference genome. So if you imagine we've calculated this distribution, which has a mean of 300 bases and a standard deviation of say 30 base pairs, then any read that's outside of say 300 plus or minus three standard deviations of this insert size is mapping abnormally. And we call those discordant read pairs. And these discordant read pairs are the ones that give evidence of structural variation. So now what I'm going to do is just go through some different examples of structural variation that we can find and what the insert size and the read pair orientation tells us about this structural variation. So the easiest one to find are deletions. So I'm going to be using diagrams like this where we have the donor. This is the individual that we've sequenced. It might be the cancer genome. And then we have the reference on the bottom. So here the reference has this large red block, which is deleted and not in a donor. So I've denoted the deletion is just having these dashed lines here. And now if we sequence the donor, we're going to sample read pairs that span across this junction, the break point where this deletion is. And now if our paired end library is around 300 bases, we'd expect that on average the distance from this end of the pair to this end of the pair is about 300 bases. Now when we map pairs that have been sampled from the donor to the reference genome, because BWA has to account for the fact that there's this sequence that was deleted, it puts the pairs much further apart on the reference. So if you look at that insert size field in the SAM file, this will be 300 plus the size of this deletion on average. So what we can do is we can scan along the genome, looking for pairs that are mapped much further than expected. And this gives us a signal of a possible deletion. Now the opposite event is an insertion. So here this red box represents some sequence that has been inserted into the donor that's not in the reference. And here BWA is going to map the pairs much closer together, because it's not accounting for this insertion. So if the distance from here to here is 300 again, then the distance mapped to the reference is going to be 300 minus whatever this inserted sequence is. Now insertions are harder to find, because if you can imagine a 10,000 base pair insertion, you're not going to have reads that span from one end of the insertion to the other. So it's generally easier to find deletions using these sort of methods than it is to find insertions. So another type of structural variation would be a tandem duplication. So here there's a sequence in the donor that's copy 2, and it's been duplicated at the same location, so just the same sequence has been copied once. So here if we have pairs that are sampled from the end of the first copy that span the breakpoint and go to the start of the second copy, when we map these pairs to the reference genome, this pair is going to align here. But because there's no second copy in the reference, this pair has to be mapped back to the start of this duplicated sequence. And what happens is that now the orientation of the pairs are incorrect. Instead of the pairs pointing at each other, they point away from each other. So this is another signature of a structural variation that we can use to find these type of events. Likewise, if there's an inversion, we have a unique signature. Here in the donor, we have this blue segment that was flipped. So it's now the opposite sequencing strand. And if we sample the pair from the start of the inversion, when BWA goes to align it, it has to put this sequence here on the opposite strand of the reference, and now the pairs point in the same direction. So it's another signature of an inversion. So this is just a summary of the different types of structural variation and with the orientation we suspect. So for insertions and deletions, we expect the normal orientation of pairs, but the mapping distance is different. For inversions, we expect the pairs to point in the same direction, but the insert size will typically be quite different than expected. For TAMDN duplications, we expect the pairs to map away from each other. And again, the insert size will probably be quite different than expected. For translocations, we expect for interchromosome translocations, we expect one pair to be mapped to one chromosome and the other pair to be mapped to a different one. And the orientation can be really anything in these cases. So as I said before, insertions are particularly a tough case because we rarely will have pairs spanning across the insertion unless it's quite small, less than our fragment size. So what can happen is that we can only, we might only find pairs that are on, that were sampled from the inserted sequence that isn't part of the reference and these won't align to the reference so that the event might not be detected in this way. So here we'd probably want to use different methods like a de novo assembler, which would try to assemble the inserted sequence and then we could map like that. These are more experimental methods that are still in development. Another signature that doesn't use paired orientation is looking at split reads. So as I mentioned before, BWA will try to introduce small gaps when it aligns the reads. If these deletions are small enough, it might align one bit of the read to before the break point and then the other part of the read after the break point and we end up with a read with a long gap in it and this might also indicate that there's a deletion here. Most structural variation callers will use both split read signatures and paired end orientation and insert sizes to jointly look at for evidence of structural variation. So when we're looking for rearrangements in cancer, there's really two different approaches that we can take. Usually you've sequenced both the tumor and the individual's normal genome and we're not as interested in all the structural variation that's in the inherent germ line genome. We're more interested in the structural variation that's occurred in the tumor. So there are two approaches to doing this. First we can do structural variation calling in the tumor sample and the normal sample independently and then filter out all of the events that are in the tumor that are not in the normal. Or what we can do is just find structural variation in the somatic samples and remove any of them that have evidence like these discordant read pairs that are in the germ line sample. And most programs that you'll download and run for structural variation calling in cancer will be following approach two where it's jointly looking at both the tumor sample and the normal sample at the same time as a stronger way of filtering out germ line events. So now I think it's new for 2014 is that there's a gene fusion module. So this is going to be covered in the afternoon. So I'm just going to skip this, but this is a type of rearrangement where there's a translocation that's joined the exons of two different genes and then you get this chimeric protein that is possibly functionally relevant in cancer. Andrew McPherson will talk about these type of events in much more detail this afternoon. Now structural variation calling is quite difficult and the methods used and the software used is under constant development. In the practical session, which we're going to start in just a few minutes, I chose a tool named Hydra from Erin Quinlan's lab primarily because it's very easy to run and it gives you a good overview of the steps required to pre-process the data to run structural variation caller. This isn't an endorsement, it's the best structural variation caller and the state of the art in the field is changing quite rapidly. I'm going to take this chance just to plug a project that OICR is helping lead, which is a mutation calling challenge run through these dream competitions. So Paul Buchos' lab has developed a set of simulated genomes where they've spiked in structural variation events like deletions, insertions, inversions and they release this data to the community, anybody who's interested in developing software to call structural variation then people run their software on it, submit variant calls to Paul's group and then they score each one to see how well the different software is performing. And there's a lot of variability within structural variation calling methods and you can go to this website which is if you just Google ICG, CT, CGA, dream mutation calling challenge you can see sort of leaderboard of which tools are performing best and this changes quite rapidly so when you're looking to run different colors on your own data this is a great resource of figuring out what the state of the art is in the field. I also recommend that you don't just pick one tool and then trust its output, I suggest you run multiple different tools and cross-check against each other as a way of increasing both the power to find true events and removing ones that are possibly spurious that you might not trust. And within the practical session you'll be looking at different structural variations within IGV as a way of understanding how these patterns of pairs look when you look at their alignments. Okay, with that we'll start the next exercise.