 To my name is Jared Simpson, I'm a fellow at the OICR. I've only been in Toronto for a few months. I came here from the Sangra Institute in Cambridge where I was doing a PhD. I'm a computational biologist. I work mainly on developing sequence analysis software and sequence analysis algorithms, which is all going to be relevant to what we're going to talk about today. The main research question that I work on is de novo assembly, which is where you take a set of sequence reads and then you try to reconstruct the genome without any other information. I'm not going to talk about assembly today. I'm going to talk more about how to do sequence analysis based on a reference genome. But if you have any questions about de novo assembly or anything else that's not covered here today, feel free to ask me to break or my emails at the end of the slides here. So the focus of this module is going to be on mapping sequence reads to a reference genome, also known as alignment of sequence reads. And then in the second part of this module, I'll be talking about how to find genome rearrangements based on sequence read alignments. So I'm going to split this up into two parts. I'm going to talk for about 40 minutes on the problem of mapping reads to a reference genome, and then we'll do an exercise or you'll get to use a map of yourself and work with some sequencing data, and then we'll have a coffee break, and then I'll talk about genome rearrangements, and then we'll have another lab where you can run a rearrangement finder on some real sequence data. So here's what I hope you get out of this module, of course. So first and foremost, I think the most important, the first step of almost all sequence analysis is finding out where in the reference genome a sequence read came from. And I want to give you a good understanding of how we find that out by mapping sequence reads to the reference genome. So understanding this process and then understanding the file formats that you work with when you work with aligned data is really the primary goal of this part of the workshop. So what I'm going to talk about is standard file formats that sequence reads come in, which is called the Fast A format and Fast Q format, and then the standard file format that mapped data is represented by, which is called the SAM and BAM format. I'll talk about some common terminology that you hear used to describe alignments, things like mapping quality paired in reads and how those are all used in downstream sequence analysis, and I'll particularly talk about how paired in reads can be used to find genome rearrangements, things like translocations where, say, two arms of two different chromosomes are swaps or large deletions or inversions, these sort of large structural changes to chromosomes. And in the labs, you'll be able to run a mapper and a rearrangement caller on your own. And please, as I go through, if anything isn't clear, just ask questions and be informal and then show them if I'm going too fast or too slow. Right, so yesterday, I think John McPherson talked in-depth about the different sequencing platforms that are available just to reiterate, this, there's a wide range of instruments available. Some will give you very long reads, like the PacBio instrument will sequence 5KB reads at a time with a higher error rate. The throughput on PacBio is quite a bit lower. And then as we go off the scale of throughput, you get up to the Illumina Hi-C machine, which can output hundreds of gigabases per run, but the reads are only about 100 bases long. And now it's common to run multiple different sequencing technologies on one project and integrating all of that data is a big problem. And the way that we do this now is we map all the data to a reference genome and then work with the aligned data to call, say, mutations in a cancer or structural changes or anything. So this step of aligning data to the reference genome is really a fundamental step that starts off sequence analysis, and that's what I'm going to be focusing on. Now I'm not going to talk so much about PacBio or IonTorrents, but I'm going to be focusing mostly on Illumina data in this module. And that's the predominant sequencing technology, and it's most likely what Mosey would be working with. Now the very first step in the sequencing process is translating what the instrument measures, which for the Illumina sequencer is taking pictures of fluorescently labeled DNA molecules and translating those pictures into the actual sequence reads. And this process is called base calling. So here's what raw Illumina data looks like. You have an image of a microscope slide where each one of these dots is an individual cluster of DNA molecules. Each cluster is fluorescently labeled and with the color of the cluster giving one of the nucleotides within the read. So all of these dots are an individual read for a particular cycle within the sequencing process. And now the important thing to understand at this point is that this process of going from these images to base calls isn't perfect. The instrument can make errors, for instance, if these clusters are too close together it might mix up what color each of the reads are in a particular position and it might give an incorrect base call. So the sequencing process is inherently noisy and we can have sequencing errors within our bases. And this informs all the downstream analysis and it needs to account for the fact that these sequence reads aren't perfect. We always have sequencing errors within our data. Now as a concrete example of that we can look at how one source of error in Illumina data. So as I said, Illumina is imaging a cluster of molecules so it's not a single molecule process. What it's doing is looking at a group of a few thousand identical DNA molecules at once. And it's elongating each strand of these DNA molecules one step at a time and usually all of the DNA molecules in that cluster are on the same position at once. So in this example we've had six sequencing cycles so we've read off these six bases and we're going to the next one. And you'd expect most of the DNA molecules within that cluster to be on the same cycle but occasionally because if a base isn't incorporated some of these molecules will lag behind. So this one's only on the fourth cycle and this one's on the eighth, it's gone ahead. And it's called a phasing problem or pre-phasing problems where some of the molecules in the cluster lag behind or jumped ahead. And this makes the signal when it's trying to read off the color for this particular base noisy because these ones are all gray but this one's red and this one's blue. So that adds some noise to the process and this is one of the sources of errors in Illumina data. So because the data is noisy we want to be able to quantify how trustworthy our data is. And we do that by what's called quality scores. So quality score is given for each base in a read so if you have a hundred base reads you have a hundred different quality scores. And this is the sequencing instruments estimation of how trustworthy the base is. So at the longest scale the probability that the base call at that position is incorrect. And so it's called the friend quality scores. It's a scale from 0 to 40, 40 being the most trustworthy basis. A base quality of 40 means that there is about a 1 in 10,000 chance that that base is incorrect. A friend score of 10 is about a 10% chance of that base called incorrect. So as you can imagine if you're trying to find mutations in cancer and you see what could be a mutation but it only has a quality score of 10 you might not trust that as much as if you saw a mutation in a read and it had a quality score of 40. And all the downstream sequence analysis software that does things like finding mutations takes these quality scores into account when it's trying to call whether there's been a change at a particular base in the reference data. Now one thing I will discuss so different sequencing technologies is they all have different error profiles. They all have different chemistry. They're all different sequencing strategies and those differences in chemistry give you different types of errors and different error rates within the sequencing data. So the Illumina sequencer has the lowest error rate right now. It's about 1 in 200 bases will be incorrect mainly it's substitution bases so if the true base is a G there the instrument might say it's a C so that would be a substitution error. For the 454 and Ion torrent data it's mainly incorrectly inserted or deleted bases especially in homopalmer runs so the way that the Ion torrent sequencing data works is that if there's 10 A's in the sequence read it might say that there is 9 A's there or it might say that there's 11 A's there so it has problems sequencing through these polynucleotide tracks. And then the other major sequencing technology the Pacific biosciences it has a much higher error rate typically over 10% to 15% it makes the data somewhat tricky to work with but it's not really a biased error model it's more of a mixture between insertions deletions and base substitutions. So for Illumina data another important thing to understand is that it's not a uniform error rate across the length of the read the likelihood that the base is identified incorrectly depends on where in the read it is so this is a profile of the errors within Illumina reads across the read length. So this was 150 base pair read so we're going on a scale from 0 to 150 on the X axis and this is a histogram of the number of times the bases were incorrect at that position of the read and as you can see the likelihood that the base is called incorrectly goes up at the end of the read and this is due to this phasing problem that I showed before which where it's more likely that the strands have gotten out of sync as you go along the length of the DNA sequence. So here it's showing two different readings because this is read pair data which I'll explain in a second so this is the first half of the DNA fragment and this is the second half where they both have about the same error profile. So there's two main file formats that raw sequence data come in there's the fast data format which I'm showing here if you've ever looked at reference genome or most sequence data comes in this format it's pretty simple format it has lines that are identified the sequence so the lines that start with this carrot symbol are the identifier lines and here's a read name and then it's followed by the sequence lines afterwards. Now when you're working with sequence pairs you'll typically have two of these files one for one end of the DNA fragment and one for the other end where reads on consecutive lines in the file are make up the one pair. Now commonly when you work with Illumina data you'll start with fast Q files here it's really the same as this fast A file except we're adding quality information in. So in the fast Q format we again have an identifier line this time it starts with an add symbol and then we have a sequence line and then there's another identifier line starting with a plus and then the base quality so this is 0 to 40 Fred's scale estimation that that the particular bases are incorrect here it's encoded using symbols in the ASCII character set but these can be transformed into these 0 to 40 likelihoods that did the bases incorrect okay so that's what the raw data looks like now I'm going to talk about aligning these reads to a reference genome see what I'm in process is really just trying to find where in this reference genome say the reference human genome which is three billion base pairs long where in that reference genome each one of these reads came from so the differences between any two pairs of human chromosomes is about one in a thousand bases and the snip rates about one in a thousand so you expect most reads if we just sequence anyone in this room's genome to match somewhere quite well in the reference genome so the liner process is just taking the reads from one individual and mapping them onto this reference genome to see where we think that that read came from now there's a couple computational issues that we have to get over first is the human genome was very large and very repetitive about 50% of the human genome is annotated as being part of a repeat things like retro transpose on the alue elements are called present one million times in the reference genome they're all quite similar to each other so we need to be able to very specifically place reads back on the reference genome taking into account the fact that the genome is very repetitive itself this is very important to get the alignments correct because if we am misaligned a read to the reference genome if we put it in the wrong place we might call variants that aren't actually true mutations so we have to be very careful when we're mapping these reads back to the reference genome of course the sequencing instruments like the luminous sequencer produces a huge amount of data typical sequence run gives you about 600 gigabases of data to go through this and align to the reference genome the software that we use has to be very efficient we couldn't use something like last which people have been using two-line data to a reference genome for the last say 20 years because it's just too slow we need software that can process millions or tens of million reads per minute to be able to do this luckily this is a huge research question in sequence analysis and there's hundreds of different mapper mappers or alignment programs to choose from and they're all designed for the specific problem of mapping these short hundred base pair reads to the reference genome and they're all quite quick and of course the genome that we sequenced the individual whose DNA was extracted and put in the sequencers isn't going to exactly match the reference genome they're going to have snips that differ from the reference genome they're going to have indels and structural variation and also there's going to be sequencing errors within the reads so the seek the alignment program has to be tolerant of mismatches to the reference and still be able to do a good job of finding the correct position in the reference in the face of the fact that the sequence reads are going to have differences with the reference sequence so with all that in mind how do we choose an aligner as I said we need to be very accurate about where we place the reads in the reference genome any times that we place a read incorrectly it's likely they will it'll lead to false positive variant calls which we'd like to minimize it needs to be very sensitive it needs to tolerate these differences between the reference genome and the sequence reads because we want to be able to find snips and we want to be able to find indels so the aligner has to do a good job of accounting for the fact that there's true differences between the sequence reads and the reference genome and as I said the program needs to be very efficient because it needs to map millions or billions of reads to the reference genome and it's becoming that the informatics cost the cost of running the computers to map all of your data to the reference genome is one of the highest costs when you do sequence project sequencing the cost of actual performing the sequencing is so low now that we have to close quite seriously think about how much the computational side of experiments are taking and that's why you want to choose an aligner that can does very efficient and can a lot align a lot of leads to the reference genome so for this we're going to use in the tutorial we're going to use the very popular liner called BWA that stands for Burroughs Wheeler aligner that's the type of data structure that uses I'm not going to go into the details of how BWA works I'm happy to answer any questions about that in the break but this is a very this is a very popular map to use it's used I'd say in most sequencing projects use this program BWA to align their reads so a reference genome you might have heard about the thousand genomes project which is a big population sequencing project that was started by many groups around the world it's led by Richard Durbin at the Sanger Institute they designed Richard Durbin's group designed this program BWA and it's used in the thousand genomes project and a lot of the other sequencing projects like ICGC now of course BWA is not the only mapper out there there's hundreds of mappers that have been published I'll just let this slide in as a very good review paper for the different choices that you have of mappers so if you're interested you can read this paper it'll talk about things like the speed of each one of these programs how sensitive they are to various levels of differences and how accurate each one of these aligners are so if you are interested in trying out a different program you can read this and hopefully inform a choice of what alignment software to use it also covers a lot of these different issues I'm talking about like accuracy and sensitivity more detail so this would be a good resource to read if you're interested in the mapping process okay so going back to aligning an individual sequence read to the reference genome this is sort of just a picture of the process we have a sequence read here showing red reference genome and black at the top and we need to pick where in this reference genome this read came from so how do we do that well we just compare this read to all these different places in the reference genome and see which one is the most likely so say we've done this and at this position here so let's say we have three candidate alignment positions one that doesn't have any mismatches to reference one that has a single mismatch and one that has two mismatches to the reference well this point we don't know if these mismatches are sequencing errors and we don't know if there are true differences between their read and the reference it might be a snip so just like how we used base quality to quantify the uncertainty of weather an individual base in the sequencing read is correct or not we have an equivalent measure which is called alignment quality or you might have heard heard it referred to as mapping quality and this is the alignment program's estimation that in place that it's chosen in the reference genome is correct so again it's just a Fred scale estimate of the probability that the chosen mapping on the reference genome is incorrect so for instance if it if the aligner says that the mapping is q30 one in a thousand reads with a q30 alignment will be placed incorrectly in the reference and this is important to know we need to know whether our alignments themselves are trustworthy or not when we're trying to find mutations between an individual and the reference genome so if you find a mutation and the aligner says that they're all q3 you probably not going to trust those alignments and you would probably say that that that mutation isn't isn't true okay so now I'm going to talk about paired end reads I alluded to this earlier right so that's good question so different programs will do different things so for BWA it will output one alignment it will try to pick the best one the most likely source of that read in the reference genome and then it will estimate the quality using this mapping quality for other programs they might give you all possible places it's from or all plausible places that the reads from we generally refer to those as multi mappers she might hear this term this is a multi mapper and it when it writes out this alignment format you'll have five hits ten hits to the reference genome and hopefully it will calculate mapping quality for each one for some applications things like let's say chip seek you might want to know all the possible mapping locations for other things like variant calling you might just be interested in the best one so it's it's again a choice of how how the software was designed but it's important thing to understand yeah that's right yeah it's essentially how many other kids like you you can let's go back to this example in a bit more detail so here for this mapping position to be the true one you don't need to account for any differences between the read and the reference for this to be true mapping location you need to account for this so say the base quality of this mismatch was very low say it was q3 it might be that those are equivalent mapping locations and the aligner can't really decide between them so when the mapping score is calculated it's taking into account how many different mismatches there are to the different candidate alignment positions what the base qualities of those candidate alignment positions are and they'll generally run a probabilistic model to calculate the q score it's using the friend score when it's deciding what the best place is yeah so in the worst case if you have two exact alignments say the read has no sequencing errors and in there's two positions in the reference genome where it matched without mismatches the aligner can't decide between these two and then it would give it a q quality score of zero it would say it's completely ambiguous there's two equivalent mapping locations I'm not going to try to decide between these two because I don't have any information to go on so if it would just assign the reads a map to these repeats as q zero so this is an important quality score that's when you really don't want to trust the read like almost no variant calling programs would look at q zero reads and trust them as being placed correct yeah as you have more candidate mapping locations the mapping quality is going to decrease for say the best possible one the ways that it the way that the programs usually work is that it can't exhaustively find every possible mapping location so it has a heuristic where it will calculate the number of estimate the number of possible mapping locations and calculate the quality based on that that's at least how BWA works for these programs these multi mappers that try to exhaustively place the read everywhere they could make a more refined estimate of what the true one is but yeah in the general point that the more candidate alignment sites there are the more the lower the mapping that's right and these sort of issues like when the Illumina sequencers first came out the reads are only 36 base pairs long actually the very first iterations were 27 base pairs long but in the first production Illumina the reads were around 36 base pairs long and at that length scale the human genome is very very repetitive and being able to say whether a read is placed correctly on the reference genome is very important so when you have 36 base pair reads you'd have a lot of these q zero alignments that you just can't trust so way around this or a way around the uncertainty being able to place reads on the reference genome is to take paired-end reads and here we take a fragment of DNA in a standard Illumina sequencing you sequence from this end about a hundred bases and that's one read in paired-end sequencing you sequence one end of the DNA and then you sequence the other end of the DNA as well so here you'd have a hundred base pair reads from either side and if this DNA fragment is 400 bases you'd have 200 bases of sequence in the middle you don't know anything about so you have a read and then an unknown seek stretch of bases and then another read and this allows you to map to the reference genome with much higher accuracy so let's go back to this cartoon of mapping here we have a sequence read pair and we want to find where it is on the reference genome so here in this alignment location we have one half of the read that matches perfectly one half of the read that matches with one mismatch and this candidate location there's a mismatch in the first half of the read and then tons of mismatches on this half of the read so this the left alignment is much more likely to be the true alignment in this case and we're able to refine these alignment locations because we have much more information here not only we double the read the amount of bases in our read from 100 to 200 but because there's this stretch of DNA in the middle we're able to map around repetitive elements much better it's the standard the size of these ALU repeats which are one of the most prevalent repeats in human genome are about 300 bases long so here have paired-end reads of 400 bases you're likely to have one half of the read map in unique sequence and that's allows you to anchor the read uniquely in the reference sequence so when we're working with paired-end data instead of having say 80% of our reads mapped uniquely to the reference genome if we have paired-end data we can have about 95% of the reads mapped uniquely to the reference genome and we'll have power to call variance in all of these different locations and the mapper of course takes into account both ends of the pair when it calculates mapping quality so you'll see a lot of paired-end data that have mapping quality of 40 or 50 or 60 very very likely that the reads are placed correctly on the reference genome okay so now I'm going to give an overview of the SAM and BAM format so this was developed at I'd say the start of the next generation sequencing era when we were all starting to work with large amounts of linemen data and there's a big push to standardize the way we represent alignments so that we every program that wanted to work with alignment data didn't implement its own file format we just had one file format and then everybody worked off this specification so it stands for the SAM standard for sequence alignment mapping and it's a tab delimited text representation of the alignment data and what that means is that you can just read it with a program like notepad or any Unix tools like less or cat or anything BAM is binary alignment and this is a binary representation which means it's encoded to minimize the amount of space it takes on the disk and you can read it in standard programs you need to run a program to output output the deliments in SAM so this is what the SAM format looks like so there's some rows of some columns of information which I'll get to on the next slides and then you have your sequence read and then it's quality scores again encoded as ASCII characters in this friend 40 format so I'll go through each one of these fields in detail now so the first field in SAM is the reading so that was that first identifying line on the fast Q format your fast A formats that uniquely identifies each one of the reads and then you have a special flag field so this is a bit masked value that encodes things like whether the read maps to the forward strand of the reference or whether it had to be reverse complemented to map to the reference because the DNA the sequencers can read either strand of the DNA so you need to be able to account for the fact that the other strand was read when mapping the reads back to the reference whether it was reverse complemented as encoded in this flag it would also encode information like if this read has a pair somewhere else in the file and if both pairs are mapped to the same chromosome and and a few other things that are more technical about how the mapping process was done moving along the next field is the reference chromosome that it was mapped to and the reference coordinate so this is the coordinate which is the start of the read where the start of the read was mapped so is this base with map to this position in reference moving along this is the mapping quality so here the mappers found a very confident alignment for this read it was mapped to the Q60 which I think is something like one in a million chance that it was mapped incorrectly so it's a very trustworthy alignment next is what's called the cigar string so this encoding tells you how the bases between the reference genome and the read line up so I don't know if everybody can see this down here but this is just a toy example so if we have a reference sequence here in our read sequence here to align the read to the reference we had to delete this T so the cigar string in this case would be 4m which encodes that there's four matches between the reference and the read these four and then 1d one deletion and then 6m six more matches so using just this compact representation it tells you how the bases within the read line up with the bases on the reference genome and then there's different cigar operations for different ways you can line it in this case there was a T inserted into the read so here we have 4m get four matches then one I one base inserted and then 4m four more matches sorry ah yes good I put this in on purpose and then I talked about it so the M's don't tell you whether the same base is in the reference and the read they don't give you mismatch information so it's only that the base is that this they're aligned at the same position not whether they're the same type so that's important to know because it's a common thing but if you say wanted to calculate how many errors are in your reads you can't just look at the cigar format there's another tag with which is similar that gives you whether the bases are incorrect in a particular position okay and then after the cigar string if you have paired in data it tells you where on the reference genome the read pair was aligned to so in this case the mappers would be close on there that means it's the same chromosome so it's chroma 19 and chroma 19 and then it tells you the position of the start of the mate on the reference and then the insert size is the length of the DNA correct from here to here so usually have a hundred base read and some distance in between where we don't know the sequence and then another hundred base read this insert size tells you this total distance between the two and that's important for structural variation calling because later on we're going to look for deviations between the expected insert size and what the mapper found as a way of finding deletions and insertions yes it is in it is in a header so I didn't talk about the Sam headers but in the in the tutorial it talks about the Sam headers and you can look at them essentially it's before all the alignments double the lines that start with at symbols that's the header of the Sam file and then within there there's a series of tags and within those tags most mappers will tell you the version of the reference that the law the reason about to I'll show this next slide so this link here sound to us source or net slash Sam that's the specification for this file format within that spec it will tell you what each one of these tags they're found in the header me so if you ever need to look up information about the Sam format you want to look at this file and then you can see okay the tag is the version of the reference things like this so the most common way of working with these Sam files is to use this package called Sam tools this is written by hang Lee what it can do a lot of different things mainly you can convert between this text Sam format and this binary band format you can also sort the alignments by position on the reference genomes work extract alignments for just the small subregion of the reference that you're interested in I just told you about the specification if you have any questions or need help with these there's a lot of good resources there's a mailing list for Sam tools where you can just post a question about any of these things like if you say want to know how I find out in version of references you can post that question here and someone will quickly get back to you and then there's these two more general web forums that people post questions to one called bio stars which is like a stack overflow type question and answer site and then there's a web forum called seek answers where there's a lot of helpful users where people post questions and get help on anything that they need help for related sequencing okay so now we'll go through the exercise the first one on read mapping so I guess first you'll want to log into your cloud instances and we can help with that if anybody needs okay so the first part was just working with alignment data getting reads mapped to a reference genome and then the natural next step is trying to understand the ways in which the thing that we've sequenced differs from the reference genome there are different types of variation from small single nucleotide polymorphisms which are just single base changes to short indels where some amount of sequence from one base pair to a few hundred base pairs has been inserted or deleted and then there's much larger changes which we generally call structural changes or genome rearrangements certain to the different classes of these can be large insertions so a kilobase of sequence inserted into the genome or deleted whole genome segments inverted translocations so bits of sequence swapped between different chromosomes and that can be copy number variation as well where entire blocks have been duplicated either within a local region or there are two different copies elsewhere in the genome so I'm going to talk about different ways to find these larger structural changes using paired and data I think tomorrow's serab is going to talk about finding shorter mutations snips substitutions and cancer short indels and also copy number variations so we all start from the same place which is these BAM files that we just generated in the in the first part of this module and then all of that is the input into these these downstream tools which will catalog the type of variation present right so standard way of finding structural variation these larger changes is using this pair done data that I introduced in the first part of the lab so there's two different ways of finding out of sequencing read pairs there's fragmentation based which is the standard way of doing it all on Illumina what you do is you take some genome DNA shear it into pieces that are about 200 to 500 bases in length and then you sequence the ends of these fragments so here the yellow part sequencing adapters I mean sequence for these adapters in words getting these paired and reads that we worked with in the first part of the lab this is a fairly straightforward way and this is the standard way of doing Illumina paired and sequencing but you can you're limited by the size of fragment that you can do with this protocol you can only really do 500 maybe 600 base pair fragments in paired and sequencing this doesn't give you really long range information if you want longer range information 10s it's kill 10s of kill bases you can use main pair sequencing and how this works is that again we take genomic DNA shear it but we select for much larger fragment size and multiple kill bases so say 1 to 20 kb pieces and we make a circle from these bits of DNA and then we cut the circle and sequence from the pieces that were caught in the circle so instead of certain sequencing this 5 kb piece of DNA we make a circle out of it and then we can sequence this part on Illumina sequencing just like before and here when we map the reads that have come from this part of this this circle of DNA they'll map very far apart on the reference genome and this is useful for finding very large structural variation when I work with the genome assemblies you it's critical to have this really long and long range information to get over repetitive elements of the reference genome so this is a really common thing to do for genome assembly or for structural variation finding but most data is the shorter range 500 base pair paired and protocol now but I didn't talk about much in the first part is the orientation of these read pairs now DNA is a double stranded molecule and sequencing always proceeds from the 5 prime end of DNA to the 3 prime end and when we sequence a read pair this say this is the 5 prime end we sequence the upper strand of DNA in this direction and then we sequence 5 prime 3 prime this into the DNA and we sequence the lower strand now what this means is that when we map the reference gene these reads to the reference genome we have to take into account the expected orientation of these read pairs and the expected orientation is that one end of the pair is going to map to the forward strand of the reference and then one end of the pair is going to map to the reverse strand of the reference so they're from different strands so we expect one of each and this is critical for structural variation finding to understand this sort of that there's an expected orientation of this of these read pairs so is that clear to everybody that the reverse complement we're expecting one to be reverse complemented one to be just normal now second thing that we need to understand fun structural variation is that when we sequence read pairs we don't always have exactly 400 base pair fragments or exactly 500 base pair fragments they follow a distribution of insert sizes so the usual way of doing size selection of making these libraries for DNA sequencing is that you will run the DNA on a gel and then cut out a band of the gel and then everything within a range gets sequenced so you don't have exactly 400 base pair fragments but you might have some there are 400 base pairs and there are 401 some there are 450 base pair fragments but we can calculate the expected distribution of fragment sizes so instead of having exact things we have sort of um it's it's it's sampled from a distribution and we can use this information of what would the expected size of the DNA fragments are to find things like insertions and or deletions now when BWA maps paired end data to a reference genome it classifies pairs as being concordant and discordant now a concordant pair are pairs that match the reference genome with the expected orientation so one pair on the forward strand one pair on the reverse strand and also it's within the expected distance so BWA if going to this example if say this is 380 bases and this is 420 bases any pair of the maps within that range is within the expected distance of the paired end library anything to match outside of that range is outside of the expected distance and the pairs that are either mapped in without the expected orientation or without the expected distance are called discordant and it's these discordant read pairs that give evidence of structural variation so we'll be looking at these read pairs and seeing what they tell us about the the genomes so here i'm i'm just going to show some slides that show different situations of structural variations so in the the top line is what we'll call the donor genome this is what you've sequenced and the bottom one is ref this is the reference genome so in this case what i'm showing is a deletion so the reference genome has this red block that isn't present in the donor genome that red block was deleted in this individual and i'm just showing it by dash lines to show that that sequence isn't present with these red bars being the break points of this deletion now when we go to sequence with a read pair when we sequence the donor we have a pair that looks like this this might be 400 base pairs long when we map that pair to the reference because we need to accommodate for this red block it's not in the donor the pairs map much further away from each other than you'd expect so the aligners say bwa had to put these pairs 600 bases away instead of the expected 400 bases so in this case this would be a discorded read pair and we could use this to try to find this deletion so what you do is you take your BAM file and you run a program that looks for clusters of pairs clusters of discorded pairs that are consistently mapped further away from each other than you expect and that's the solution uh that's the signature for deletion where this insert size flag this insert size field in BAM file is much larger than expected now the opposite of that is a signature of an insertion so in this case now this red block is only present in the donor but not the reference so here when we sequence it again if this pair is 400 bases for BWM to align this pair to the reference genome it had to put them much closer together than expected so here they might only be 200 bases apart and in this case what we can do is go through our BAM file and look for read pairs where they're consistently mapped smaller than expected so that's the signature for insertion now of course we can have much more complex changes here's what it looks like if piece of this donor genome was duplicated in tandem so we have this blue arrow which has two copies in the donor if we have a pair spanning these two copies where the reference only has one copy this read is going to map here this read gets mapped here in the roll-on orientation so in this case we have the signature where the insert size might not look abnormal but both reads the reads are mapped in the roll-on orientation this one might be mapped forward strand this one might be mapped to the reverse strand but they're pointing away from each other instead of pointing towards each other like you expect so this is a signature of tandem duplication likewise if a section of the genome wasn't inverted so here again we have a blue segment of the donor genome where I'll flip it so I just go ahead to tail here in the reference genome we have pair mapping from here to the blue the tail of this blue in donor this maps here but this one maps all the way over here again in the wrong orientation so this would be the signature of of an inversion now when you're working with paired in data there's sort of a shorthand for discussing the orientation of reads reads that are maps in the expected orientation are called FR so for forward reverse reads that are mapped like this would be RF reverse forward and reads that are mapped like this would be forward forward so that gives you some this is just a way of referring to how the different read pairs are are referred to so if we summarize what the different signatures of structural variations are it looks like this so an insertion the distance between the ends is smaller than expected but the orientations are normal in a deletion the distance between the ends is much larger than expected again the orientation is normal for an inversion we don't know too much about the map distance they could either be very very far away if it was a megabase inversion or they could be close together it was something more local but the orientation is this ff signature where both reads are mapped to the forward strand pretend duplication again we don't know we can't say what the map distance will be usually it will be different than expected but we have this signature where it's the first reader is reversed and the next one is forward I didn't show any signatures of interchromosonal rearrangements but usually this is quite the signature is quite easy where one half of the pair maps to one chromosome the other half a pair maps to another usually I mean the reference strand so one of them matches the strand of the references in the FASTA file and then the one that's R was reverse complemented to match the reference so say the reverse reference strand okay so we don't have perfect power to find very structural variations if these insertions are very long we might not have long enough paired ends to span from one end of the insertion to the other so say this is a 2kb insertion and we only sequenced 500 base pair fragments of DNA we're not going to pair this here in pairs here we just have pairs that fall in the middle of this insertion and in this case we're not going to be able to find that so we're somewhat limited in the size of insertion that we can find with 500 base graph DNA fragments this is why you can do made pair sequencing to get longer things for deletions you can find deletions of any size because even if a megabase is deleted you're still sequencing on either side of the breakpoint with a 500 base pair fragment so that this type of structural variation analysis is is usually referred to as discordant pairs there's a different line of evidence for finding structural variation which are called split reads so these are reads where we couldn't align the full length of the read to the reference genome so usually we expect to if we have 100 base pair reads to map that full 100 base pair segment to some part of the reference genome if a read falls on a breakpoint of one of these structural variations we might not be able to map the whole length of the read we might have to either just partially map it or map it in multiple places and these are called split reads so here's an example of a split read signature for a deletion here the first read covers this breakpoint is red line and the tail of the read maps to this part of the breakpoint here and then ahead of the read maps to this point here and if your line is sensitive enough to long deletions it might map it like this where I'm not doing a huge gap so if you looked at the cigar string for these sort of deletions it would be something like 50 m 50 matches and then 200 base pair deletion and then another 50 matches so this is another line of evidence that you can use to look for structural variation sometimes what you might do is you could use a discordant pair program that will look at discordant pairs to find structural variation and then you can look for split reads or partially line reads around the breakpoint to find the exact location of where this event was usually when you're making calls just based on discordant pairs you don't have a good sense of where the exact break in dna was you only will know it's within a certain range but if you look at split reads you can get down to the individual bases that were that formed this breakpoint okay so just a few brief words on particular approaches for cancer sequencing so when you do typical cancer you'll sequence say the individual's normal genome and you'll sequence dna derived from the tumor and then you have a couple different strategies for doing the analysis you can either call structural variation in the normal and structural variation within the tumor independently and then if you're just interested in the rearrangements that are in the tumor you can filter out all of their structural variations that you found in the normal to just give you a set of tumor calls this tends to um because you don't have perfect power to find structural variation you can have a lot of false positives in the tumor if you just didn't make the same call in the normal so second approach to this is that you find somatic structural variation so the ones that are in the tumor and then for each one of these your somatic structure variations you can look in the normal reads to see if there's any evidence for those same events so say you found 1kb deletion in your cancer sample you can go to the same region take your BAM file of the normal and then look for any discordant pairs that show a 1kb deletion there and that way you can help to filter out anything that has evidence in the germline and this can lower your false positive rate for structural variation calling so special case of structural variants that are important for cancer um are gene fusions so this is where you have two different chromosomes then you might have a translocation between these chromosomes that brings the gene together creates fusion protein um now you can find these using the sort of tools that i just discussed you can find them with paired end data looking at discordant pairs but if you do RNA sequencing which is going to be covered a bit later in the course and john might have talked about yesterday you can look for RNA reads to cover this fusion protein break so you can look at some reads the map to gene A and some reads the map to gene B and there are programs one that was written I think it's a GFC that's called defuse they will take RNA seek data and look for fusion proteins directly from the RNA so that's another way of approaching structural variation detection if you're particularly interested in looking at fused genes okay so that was uh the the talk part about rearrangements is there any questions on those different signatures before I introduce the uh the lab you could look for the same translocations and just see if the translocation breakpoints are say within exons and this would be an alternative way of finding fused genes from DNA data it probably is more common to find gene fusions from RNA okay so I'll introduce you to the second part of lab so here we're going to take the BAM file that we produced with BWA in in the first part and we're going to use a program called HydraSV which was written by Aaron Quinlan to find the rearrangements so there's multiple rearrangements in the sample that I gave you reads for and again you'll just walk through the different steps of running this program to find these rearrangements so I put the link to the paper up there it's published in genome research a few years ago I'm using Hydra for this example because it's quite it's well designed and quite easy to run there are many many structure variation callers you might have heard of breakdan sir or uh GSA v or pindal um they're all very commonly used programs often people will run multiple different structural variation callers than then either merge the results or pick pick ones that look like they have the best sensitivity so it's worth spending time to learn to multiple different packages um and learning their strengths and weaknesses I just put down here that in bar informatics it's like a golden rule that there's rarely one program that does absolutely everything perfectly so it's worth taking the time to learn the strengths and weaknesses of different programs