 Yeah. Okay. Thanks Michelle. So again all these slides are free for your use as long as you reference where you got them. Oh, I'll just speak into the microphone. Okay great. So in this module we're going to discuss data visualization which actually plays a pretty important role in genomic research and makes it possible to observe outliers, correlations and trends in large data sets, which is critical first to understanding your data and second to fine-tuning parameters and steps of your analytical pipeline. So increasingly visualizations perform numerous times throughout your analytical pipeline rather than just at the end on the final result of your workflow when you want to generate a figure. So hopefully by the end of the module and the lab you'll have a better understanding of the kinds of tools available for visualization and specifically you'll have lots of hands-on experience with inspecting variants in a genome browser. So the talk is organized in two parts. In the first part we're going to look at just discuss a couple of visualization tools and specifically genome browsers and we'll have an overview of IGV and in the second part we're going to talk in a bit more detail about single nucleotide polymorphisms and structural variants. There are lots of other things you can do with IGV. You can look at splicing information if you have RNA-seq data. You can look at lots of different kinds of data that we don't have time to go over today so I encourage you to look up the IGV tutorial from the Broad which has 200 slides on basically everything you can do with IGV. So first I want to just spend a couple of slides convincing you that visualization is a process that is really useful to your analysis and in contrast to the other computing tasks that we're going to talk about in this workshop it's something that in many cases the human brain can actually do better than a computer. So let's talk about one compelling example of this. This is a famous data set called Ascom Squartet which comprises four sets of two variables each. So we see the four data sets here. Each has an x and a y. We can stare at this matrix all day and it's not going to tell us much. Typically summary statistics like averages or variants of each x and y is one way to do a high level comparison but in this particular case in all four data sets the x and y summaries are identical. So from the summaries you might conclude that the data sets are identical until we actually visualize the distributions. So when we do that we see the real relationship start to emerge. So data set one is a set of points that are roughly linear with some variants. The second data set fits a curve. The third one is a very tight linear relationship with one outlier and in data set four with one exception x is constant in every case. So you can immediately see with your eyes that these are different data sets although you're not going to get that from summary statistics. So visual check is very quick and effective which is one strength and the other strength is the ability to identify outliers. And this is very important in designing your workflow because outliers are typically what we look for. I send these rearrangements. All these things are outliers which are going to be rare in the genome. So identification of outliers is something that we can do amazingly fast visually without much conscious effort and this is a process called pre-attentive processing. I don't know if anyone's heard in the audience about pre-attentive processing before. Well it basically means the stuff you notice right away. So our brains have the capability of capturing things like color, shape, and size and processing them without any conscious effort. So we can tell immediately when part of an image is different from the rest without really having to even focus on it. So in an example like this one on the top right where basically there are only one or two differences where everything else is the same, it's very easy to spot the outlier. Have you guys seen a Where's Waldo book before? Can I show of hands? So some people haven't. So this is a very nice example of finding Where's Waldo. Here's a more typical example. Up here on the left you can see where Waldo normally hangs out which is a scene that has a great variety of shapes and colors and no obvious trend and typically that's just the scenario that defeats our pre-attentive processing. And actually researchers at MIT have used eye tracking devices to observe how people go about finding Waldo in a case like this. And so here's a typical search. This jagged line is the path of the eye as it scans across the page and checks every face and goes on and eventually ends in success where this pink dot is. So this is a sequential process. It's called attentive processing. It lets us focus on one thing at a time and it takes a really long time. So in general we want our data visualization to be effortless and therefore we want our visualization tools to emphasize those attributes that are critical to successful pre-attentive processing. So these are things like color, shape, size and visualization tools developed for genomics have really emphasized the adoption of these features in the user interface as we shall see later in the slides and also in the lecture or in the lab. So here are some examples of cases where it's very easy to spot outliers. So to summarize we want to visualize so that we can identify patterns and outliers. I want to do so in a fast and efficient way. So what tools can we use to visualize data? Well it really depends on what your data is and how much of it you have. Let's suppose you're looking at interchromosomal structural rearrangement in a large cohort of patients. You might want to generate circles plot for instance. Some of you may have seen these circular plots where you can draw lines that connect different parts of the genome which are which are connected. If you're looking at mutations you can use a genome browser like IGV or Savant. Privacy is another important consideration especially when working with patient derived data. So in that case you want to make sure for instance that you don't have to upload your data to some to some place on the web that's not secured. So that being said genome browsers are typically a very good option for light to thorough investigation in most cases including those are privacy as a consideration. So yesterday I counted actually 49 different genome browsers built by different teams for different purposes. Most of them solve the core challenge of being able to interactively look at various types of high throughput data in the context of a reference genome. So that genome can be human, mouse, fly, pathogen. I know some of you guys are doing pathogen analysis and there are actually specific browser browsers it turns out for facilitating that type of analysis that I haven't personally worked with before but apparently it's a niche that definitely has tools in it. So if the task at hand is visualizing alignments to the genome inspecting variants all the things we're going to look at today. Genome browsers are a great tool. They can handle large files stored either locally or remotely and they can keep that data private. IGV is actually the most highly used and fully featured browser and the one we're going to spend today talking about but another popular one is the UCSC genome browser and an offshoot of that is the UCSC cancer genome browser that lets you browse various cancer specific data sets. Traxtor which is built on top of the genomic workbench galaxy which is going to be in module 7 and the savante browser which has plug-in style analytical tools. These kinds of browsers can perform visual analysis or sorry they can perform analytics coupled with visual analysis. So there are lots of choices up there and depending on your analysis needs I encourage you to take a look at a few of them. So for the rest of the talk we'll focus in more detail on the IGV, the integrated genomics viewer. Its design is basically focused to integrate genomic studies so it has support for both array based and next generation sequencing data and it can integrate this together with clinical or specific data. So a user can visualize and explore their own data set or a user can explore any publicly available data set or user can integrate public and private data sets together which is extremely useful. So I just want to mention before we go on that the following slides which have the broad logo on the top right here are in part adapted from the broads IGV tutorial which is an excellent resource. So just quickly noting some of the features of IGV. The user interface is pretty intuitive and easy to learn and we're going to learn it today. Like I said it can integrate multiple data types with clinical information so someone might have sequencing data on top of copy number data on top of expression data from the same patients and you can load all those in IGV along with clinical information and do all sorts of sorts and visualizations on that full set. It can pull data from local or remote locations and it also allows automation of certain tasks using what's called a batch job. So these can be invoked while running an IGV session so later we'll have an IGV session in progress and we're going to go through a number of examples and then at the end of the lab we're going to invoke a batch job that will just do everything we did manually in one go. It can also be used non interactively for instance you can invoke IGV on the command line and pass it a batch file and say IGV open in the background do everything I've specified in the batch file for instance take a snapshot of the read alignments at every SNB of interest produce a file in a specific directory and then close and it will do that and then all you have to do is go through the files one by one and just identify regions that might look messy or that you'd want to look further into in more detail. So depending on what your task is there are different ways to use it and as we'll see in the lab in the same session of IGV again we can view local files which we're going to do in a secure environment and at the same time we can view remote files from public data sets without actually downloading these which is very useful if you want to look at a lot of big data that's out there and you don't want to host it on your own system. Okay so in the next part of the talk we'll take a quick look at how IGVs launched how we select a genome reference and load data and we'll take a quick look at what whole genome sequencing looks like which Matei gave us a brief glimpse of and finally we'll go over some more in-depth examples of SMBs and structural variants so that we have a good understanding of these before we get to the practical exercises of the lab. So launching IGV is pretty easy on the IGV website there's a download button and if we click on this we're taken to a page where we can simply launch IGV with a certain amount of predefined memory and this is the easiest way to do it. Alternatively we can install the application locally so you could just download IGV and then from your computer and if you want to invoke IGV command line that's the way you would do it. The graphical user interface that comes up is divided into a number of controls and panels. At the top left is a command bar with controls for selecting a reference genome which has to be chosen first before loading any data. IGV provides dozens of hosted genome references to choose from but it also allows you to add your own so if you're sequencing a novel if your genome of interest is not there you could add your own you just need a FASTA file. Data files can be loaded in multiple ways as I mentioned you can use the built-in file browser to select a file from your local system or you can enter in a URL if the file is hosted somewhere on the internet or you can select entries from the server menu. By default this menu provides access to data and annotation files that are hosted at the Broad Institute specifically for viewing an IGV so this is a lot of public data you can explore this menu when we pull up IGV in the lab. In this particular example we're going to load some chip seek data representing four epigenetic marks profiled in one individual so again this is the GM individual that we've talked about before. So this is what happens when we load these four tracks we see four horizontal lines come up in the data panel of the window so just to label these all these different parts of the window we go to the next slide and we see that this is the data panel. Up here is a toolbar that has a couple of clicking functionality so we can turn certain things on and off and we'll play with this in the lab. We also have pull-down menus and in the genome ruler section we see that we are actually zoomed out to see alignments against the whole genome so in this case we see that we're looking at the human genome reference and we see chromosomes want to lie. If we were zoomed into a particular chromosome we would also see the ideogram like Matei showed us earlier with a red square over the region that we're actually browsing in. The remainder of the window is divided into one or more data panels and an attribute panel which I forgot to label but it's right here it's the set of this matrix of colored bars so basically sample or the the data from each sample or patient is displayed as one horizontal row each one is called a track. If any sample or track attributes have been loaded for instance age or sex of the patient or data type perhaps these will be displayed as a color-coded matrix in the attribute panel so this can be sorted on for instance if you're looking at primary versus recurrent disease and you could sort on you want to see mutation if your mutations cluster with primary disease or recurrent disease and so on and so forth and at the bottom we see genome features which are loaded by default by default these are RefSeq genes you can load any annotations here you can row you can load a different version of genes ensemble genes perhaps you can load transcription back to binding sites or any annotation that that is basically mapped to the genome. IGV currently supports more than 30 different file formats including many of the common formats for genome annotations as I mentioned RefSeq also all these genomic or gene annotation sequence alignments so BAM files that we generated earlier today variant calls, microarray data, copy number data, the full file format list is available at this link but basically the file format defines the track type that IGV will launch when you load a file and that track type determines predefined display options. So in our case when we're looking at sequencing data the files are in BAM format and the track type is there for an alignment if we're zoomed out too far and in this case you can see here the chromosome ideogram we can't see individual reads yet because it would take too much memory to load them all but we do see a message that tells us we have to get closer so you know when you're zoomed out too far to to see your data that you just have to zoom in a bit more. What we can see in this view is actually a bar chart of read coverage here so it's these gray bars and you can see that the coverage is fairly uniform along the chromosome until we get to the centromeres which are areas with typically a very high repetitive content and so usually get a huge bump at those locations so that's a typical coverage plot of a chromosome. We need to zoom in to about 30 kilobase window in order to start to see our alignments although this is a threshold that can be changed depending on what your data is like so if you have very high depth data maybe you did really high depth excellent sequencing then you would want to perhaps lower your so you'd want to modify your settings to take into account whether you have maybe low-pass genomes or high depth exomes and it will also depend on how much memory you're running IGV so that's something to play around with. When we zoom in sufficiently IGV uses color and transparency to highlight interesting events so these are outliers so we're going to look for outliers in the data and it's going to visually de-emphasize things that are not outliers so we can see individual read alignments as these horizontal bars these gray bars the direction of each read is denoted by the pointy end of the bar so each bar has one flat end and one pointy end that points in the direction because these reads are inward-facing. The gray color in this case indicates that the mapping quality is good so there so all these reads are mapping with good quality at this location in the genome. Now the color bars that we see here in both in the reads and in the bar chart of coverage these indicate locations where a large number of read bases mismatch the reference so these are helpful in identifying putative SNPs. The relative size and color of these bars indicates the allelic frequency at each base so in this case we can see that the allelic frequency of the alternate base in this case a T which is color-coded blue is almost 100% so this is a homozygous mutation or polymorphism so this will give you an idea of the allelic frequency and the intensity of the color gives you an idea of the quality of that base so we can see here in the inset that high that positions with high base quality will have a very strong color and position with low base quality will have a very weak or very transparent color so for instance this particular case in the top read on the right this G it's a very light orange that's a very poor quality call so we wouldn't really trust calls like that for as evidence of an SNB so there are a number of metrics that are useful like base qualities and evaluating whether SNP is valid or a polymorphism is valid and the same for structural variants so for SNVs in addition to base quality we're going to consider read coverage the allelic frequency of the alternate base strand bias and mapping quality of the reads for structural variants we'll take a look at coverage and most importantly insert size and read pair orientation so examples are the best way to consider these my matrix let's go over a couple of examples for SNVs in this case this is an SNV that changes at a C to a T so we can see it when we're zoomed in enough at the bottom here we can see the actual sequence of the reference so we can see that this should be a C and in this patient we see a T we also see that there's decent read coverage there's a little number here I don't know if you guys can read it it's 50 that's the highest bar on the bar chart so the read there are 50 reads covering this position and you can see that it's a heterozygous SNP so it's about 60% T and 40% C the color transparency in most of these cases looks really good so the base quality is actually quite high obviously we want to scroll down and see all the reads but this would look like a decent candidate for a real SNP in the second example we can see that the reference allele which is an A is changed to a C but only in about 30% of the reads so here the reads are color coded the reads on the forward strand are colored pink and the reads on the reverse strand are colored blue then the alignments are sorted by base so the things with a different base than the reference show up at the top and what you can see here is this is immediately obvious is that there's a strand bias for this alternate base call so the C only shows up on the reverse strand reads and after this mismatch you can see that there are additional mismatches in all of these reads so this is this is kind of a super obvious artifact and not a true variant so this is that this is what you'd look for to exclude variants okay we'll do more variants in the lab I'm gonna switch now to talk about structural events which aren't as straightforward to interpret so we're gonna spend a few slides talking talking about how they're detected so we'll start off with deletions translocations and inversions oh and by the way feel free to interrupt at any point I'm happy to answer questions if something doesn't make sense otherwise I'm just gonna keep going but feel free to ask questions yes yeah so the question is are there differences between SMBs and SMTs so polymorphisms are things that are that are changes found just in the general population at a certain frequency so we have 3 billion base pairs we're gonna have 3 million Smiths each person has an average of 3 million Smiths and these are gonna be found at some percentage in different populations SMVs are usually somatic mutations and so in a tumor so they're not germline so a snip will be inherited from your parents a polymorphism will be inherited from your parents a variant of which would be an interesting thing to look for in cancer analysis for instance is something that is thematically acquired in a tumor or at least it's a very rare in the population and it's usually it can be deleterious so in the patients so for somatic SMVs we would look for cases where there's a mutation in a tumor and not in a germline of that individual which means that which would indicate that that variant is somehow potentially involved in disease initiation or progression there are germline variants that can cause predisposition to disease but those are a bit more rare and those would be rare in the general population so they're typically I think threshold for when we do SNP analysis is we exclude anything that's found in more than 1% of the population and if it's found in less than 1% of the population we might keep it in for further analysis yeah so from the IGV at the beginning we have to choose the rare functionals and the data was already so the question is how do we know what reference genome was used to align our data typically well if you've done the data alignment yourself you will know if you haven't done the data alignment yourself it may say in the BAM header so the BAM file always has a header and depending on the parameters that were used to run the alignment in many cases the tool and the version of the genome that was used of the reference genome will be reported if that's not the case and you're pulling data from a public repository it should say it should say wherever it is your if it's a geo data set it'll say this was an HG18 human data set so I would definitely you will also if you have RNA-seq data quickly figure out if you're in the wrong genome because your reads won't match excellence so if you're looking at RNA-seq data and you forgot to change the default from HG18 to HG19 and you load HG19 data on the HG18 references you'll see very nice peaks of reads that correspond to exons which are nowhere near the real exons or are shifted and so that'll be a really good clue that that something needs to be changed yes I'm pretty sure that when you're in one window you're always looking at datasets aligned to one reference sequence so if you want to switch from mouse to human you would run I guess two instances of IGV one where you're loading the mouse genome as a reference and one where you load the human genome as a reference it's a good question okay so let's talk about structural variants we're going to look at deletions translocations and inversions these are a bit more complicated to investigate and we're going to rely on two main sources of evidence and considering them in bird insert size and pair orientation so let's talk about why these are important first by briefly just reviewing how sequencing data is generated so very briefly DNA is isolated and then it's fragmented and those fragments are actually size selected typically all these fragments are run on a gel and people cut out a band around a certain size of interest so if you're making a library of 350 base pair fragments you would cut out a very tight band around 350 and then you would continue on the library of prep just using those sequences so most fragments when you do this if you look at fragment distribution after your size selection you'll see that most of them it'll be a normal distribution with a mean around 350 if that's the band that you cut out and then there will be some fragments that are slightly bigger and slightly smaller so it's going to be a kind of a normal distribution around that so that's a bit critical info to keep in mind adapters are then added to the ends of each of these fragments these are denoted by these little orange bits on the end and reads are generated from each end as Montaigne told us so the insert size is actually the actual insert size is the length of this DNA molecule between the two sequence fragments we had a question previously of how long are your reads versus the fragment typically we read because the longer your reads are the more errors you get so typically with luminous sequencing we do 100 base pair paradigm reads that's like the most most data is like it is like that today or 150 or 125 so usually if you have a 350 base pair fragment or longer you're going to be able to read the ends and in the middle you'll miss but because you're generating so many fragments which are going to be overlapping in nature you'll be able to tell the reads across the whole region so when we align these reads to the reference genome we can measure what is called the inferred insert size this is the distance between the starts of the two reads in the alignment so if there's a consistent discrepancy I'll let me just go back to this slide for a second if everything is perfect and your library size is perfect and everything's at the mean of 350 then the actual insert size will be 350 and the inferred insert size will be 350 if we see consistent deviations from this scenario that would be evidence of a deletion or an insertion when we pair is actually mapped to different chromosomes that's evidence of a translocation and obviously the inferred insert size cannot be calculated in most cases okay so let's look at a deletion in such a case how does the library insert size differed from the inferred insert size well let's look step by step in our sequence subject on the bottom we have a portion of the genome deleted so it's gone the subject's DNA is then fragmented and this fragment is the right size 350 base pairs let's say its sequence and the reads are now aligned to the genome which still contains that portion that's deleted in our subject so the reads are going to line much further apart than we would expect and so in the case of deletion the inferred insert size is always larger than the expected size based on the library separation protocol and to find such events visually IGV has an option to color alignments by insert size so when we right click on the the data panel the name panel we get a lot of options for how to color code or sort or group our alignments and in this case we would color alignments by insert size and this is what we would see in the case of a deletion this is a 15 kilobase heterozygous deletion we know it's a deletion because the read pairs have very large insert sizes and are colored red we also see that there are still reads in the deleted region so it's not complete it's not homozygous lead elated so one could conclude that it's a heterozygous deletion although there's there are other possibilities does it anyone think of a different possibility it could be for instance that this is a tumor sample and there's heterogeneity and so in half of the cells in the tumor it's a homozygous deletion and in half of the cells of the tumor there's no deletion at all and when you pull the cells together and sequence it you're going to see this exact pattern so tumor heterogeneity is kind of an additional layer of information now that will affect the data but we're not going to have time to talk about today but just keep that in mind as a as another possibility if you're working with cancer data so similarly to deletions which are colored red insertions are colored blue and reads where one pair maps to another chromosome those reads have a different color code so these are these are the codes that colors that would mark a translocation and it's easiest to see by looking at an example so here's an example of a translocation between chromosome one and six brown is the color code for chromosome six so we know that all the partners of these chromosome one reads reside on chromosome six and vice versa the partners or the reed pairs of all these reads on chromosome six actually reside on chromosome one and this also highlights another feature of IGV which I didn't talk about which is that you can actually view multiple regions in the same window so when you have something like this you can split your view between the two different locations where your reads are mapping as an additional layer of information we can use read pair orientation to identify inversions duplications and translocations or even more complex rearrangements so we'll talk about those in a little bit of detail let's consider an inversion first unless that you guys have questions about the previous section okay good so we're going to consider an inversion the segment of interest in its normal orientation is in the reference genome is marked from A to B it's this brown bar in our subject that we're sequencing this sequence is inverted so we have B to A now when we generate libraries from the subjects genomic DNA some of our sequence fragments will span the break point and when we align reads to the reference genome the first read is going to align to one side of first break point so in the blue area of the reference genome but the second read is going to align within the inverted sequence and on the opposite strand and similarly on the other end one read will align as expected beyond the second break point and its pair is going to align on the opposite strand and within the inverted sequence so we see this pattern when we look at real alignments we see that both break points will have alignments pointing in the same direction instead of in mark-facing as we would expect and there's a color code for this as well so that we can easily identify these when we look in a genome browser teal is the code for read pairs corresponding to the break point on the left of an inversion and blue is the color code for reads corresponding to the break point on the right of an inversion so not surprisingly you can select color by pair orientation to make these kinds of anomalies obvious in the browser and we can see what that would look like here so this is the case where this region of the genome is inverted we see a drop of coverage where fewer reads can be mapped to the to the break point and we see that the reads that can be mapped have this specific pattern of blue blue or teal blue pointing in the in the same orientation so this is a comprehensive chart of the color options for different kinds of mapping anomalies this is available on the IGB website and it always comes in useful when looking at data we shall see an example of a translocation in the lab this is the these are reads that are color coded green so keep these colors in mind essentially reads that are colored green are pointing outwards we talked about the teal and the blue reads and then the gray reads are our ones that map in the expected pattern so those are those will be most of the reads okay so actually that only took 40 minutes for the lecture so now we're gonna have plenty of time for the lab I think and we're gonna delve in some fascinating examples of all these kinds of scenarios and I think I'm going to just delve into the lab right away