 Okay. Does the microphone help? Can you guys hear me at the back? Yeah. Okay. Good morning. Welcome to module four. Module four was meant to be a natural progression of the first three modules. So yesterday through Mike and Aaron's presentations and lab work, you got the chance to take reads, align them to a reference genome, call signal nucleotide polymorphisms, call structural variants, and now we're going to become experts on how to use visualization tools to help look at these and in particular see if they're actually true variants and sort of assess the quality of them. So learning objectives of this module are sort of to appreciate the spectrum of visualization tools and genomics. We're going to be focusing on the set of genome browsers, but I want you guys to see the bigger picture in the context of all the other genome visualization tools that you could possibly use to throw out your problem. You guys are going to become gurus at looking at signal nucleotide polymorphisms and structural variants. I know you guys sort of ended short yesterday with structural variants, but I promise you you guys will almost be sick of looking at structural variants by the end of the labs. And then if there's time, I'm going to talk about looking forward beyond genome browsers and where we see the future of visualization tools going, hopefully, if we have time. Okay, so part one, before the break, we're going to do a short lecture on genome browsers, then the lab's going to show you visualization of SNPs and structural variants, and then the second part, if we have time again, is talking about search engines. If we run out of time, we'll just move part one into part two. Okay. I want to start by further motivating the need to visualize your data. Often it seems as sort of this process that you do at the end, but I really want to sort of coach you and sort of convince you that visualization is a process that is really beneficial to your analysis, and it's something that computers can't really do very well. So the first example, which is shown in all sort of data visualization conferences you ever go to, is this example called Anscombe's Quartet. So what you see here on the left are four sets of data. They're x, y pairs, and they all have the exact same mean and variance. What you see on the right is a visualization of those four data sets, and so what becomes very clear just by quickly looking at it is that the data sets are completely different. They have different set of trends, and just by looking at a picture of them, you can easily tell that they're different by having sort of the simple statistics being identified. So understanding trends is one thing that visualization is really good at doing with a minimal amount of work. The second thing is the ability to identify outliers, and this is really important in the context of debugging your software. So your visualization system pre-attentively processes this image, and what that means is you do unconsciously you can detect that red dot. You actually do minimal amount of effort, and I want to prove to you that this happens entirely unconsciously. So this is, there's going to be an image that flashes on the screen. It's going to be very quick, but you will be able to identify the outlier in the image. What was it? Red dot. So you actually have a set of different pre-attentive features that we can take advantage of, which includes color, size, shape. So if we use these different parameters in our visualizations, you will be able to detect things like structural variance and semi-nucleotide polymorphisms really quickly. So the point is that you are a relatively low-cost and high-performance sense-maker to identify patterns in your data set, and debugger to identify issues with your data set. And this is compared to the cost of writing a script that would go through and otherwise identify those patterns. Okay, so let's jump into visualization tools. So there are a whole suite of different visualization tools that you could use to look at your data, and the question I often get is which one do I use? There are just alone over 40 different genome browsers. So it really depends on the task at hand, the kind and size of your data and data privacy. So we'll get to genome browsers in a second, but I consider the genome browser like the sledgehammer of visualization tools, but there are certain problems that you want to solve with a wrench, for example. So other tools, which we won't discuss in detail today, these are just two such tools that came out of Martin Kozinski's lab. You've actually seen these yesterday. This is a circles plot which helps you to identify structural variance that range over a distance. So here we have the chromosomes outlined along a circular sort of path, and we have connections going between them. The hive plot is a tool built by the same guy, and it's good for identifying trends in clusters of groups of entities. So if you've seen sort of the hair balls, if you're looking at a protein-protein interaction network, this is a way of sort of clustering and connecting various entities. So if you're doing that kind of analysis, that's what I said. It depends on the task at hand. If you're doing that kind of analysis, this is something that is much better suited for a circles plot than a genome browser, for example. So, but in the context of this workshop, we're, again, going to be focusing on genome browsers, and just to make the disclaimer, I'm the developer for the tool on the right. Just so that's clear. But I would recommend these two tools if you're going to be looking at high throughput sequencing data. So the reason that is, is that these were made in the era where high throughput sequencing data was just becoming popular, and so they were built upon that structure. So they're especially good for looking at previously identified variants, like the ones that you guys did yesterday. It has the ability to handle large BAM files, as you've already seen, which are stored locally or on a web server. And one advantage is that you have implicit data privacy. So because these actually run on your computer and not in the cloud, you can keep all your data local, and depending on your data privacy restrictions, that could make or break your decision. But I don't want you to be sort of narrow-minded. You could also try UCSC Genome Browser, and this new genome browser called Traxter. So UCSC Genome Browser is probably the most traditional and widely used genome browser, and it's been retrofitted recently to be able to handle your own BAM files. So you should actually look at this in UCSC. Traxter, I hope it's part of the Galaxy Lab today. It's really cool. It integrates with Galaxy in the sense that you can fine-tune parameters for programs that you're running on the cloud, and once you're satisfied with the local window, sort of, you just run the analysis on your local window and it shows you the results. Once you're satisfied with those results, you can then dispatch it onto the cloud and it will run on your whole dataset. You can do visualization and analysis to sort of spot-check before you actually commit to running a long process. So the genome browser that you guys are going to be working with today is Subant. It's the tool that I've been developing. It's particularly designed to use high-throughput sequencing data and to emphasize single nucleotide and structural variance. And so what we tried to do is make identifying those events pre-attentive. On the left, you see an image of a SNP. So I'll explain what these are in an actual demo, but this is a coverage distribution split by strands. So these are reads coming from the positive strand and the negative strand, and this is just a plot of coverage. And what you see here on these slots are the proportion of reads which mismatch with respect to the reference. So a SNP is just evidence by a single line. We came up with this new visualization technique for mate pairs. So you guys learned of paired and mapping approach to identifying structural variance. And what you see here is an event that's a homozygous or heterozygous deletion. And I'm going to explain how we encoded the mate pairs in a second. Okay, so it's time for some more DNA gymnastics in the words of Aaron. So Aaron covered the processes of finding structural variance. Two of them are depth of coverage and paired and mapping. And depth of coverage is a pretty simple technique. If there are two copies in the genome that you're sequencing and only one in the reference, you're just going to get an overabundance of the reads and you'll see sort of an overabundance of that pilot when you're looking at that region of the genome. Conversely, if there's been a deletion, so if that individual only had one copy versus two, you're going to see a relative decrease in the coverage. Another approach is this paired and mapping technique, and this is the ones that I'm going to go through. So in the case of a small insertion where your library is such that mate pairs actually span that insertion, what you'll get when you perform your mappings is a cluster of discordant mate pairs that have insert sizes or mapped insert sizes which are less than what you'd expect. So insertions get smaller. I remember that. Large insertions. So in the case where you have a large insertion such that the mate pair doesn't span the entire length, you're not going to get anything which crosses that break point. And so here we have basically everything, all the mate pairs coming from the left of that break point, map to the left, and everything to the right, map to the right, but you get nothing crossing. And so you'll just see a dip right at the particular break point. The effect. Any questions so far? I suggest that you use these slides as a reference. When you're looking at these instances in a genome browser, use this as a reference. And if it helps draw it out on paper, like Aaron was saying, it really makes sense once you draw all these out and try to map everything. So in the case of a deletion, when you have this mate pair in the middle is spanning the break point, we will map either end to the sides of the deletion and you're going to get a cluster of discordant mate pairs such that they're too large. Their map distance is too large. So the only trick is insertions that span the break point, the mate pairs are mapped closer together, and deletions, they're not farther apart. Okay, so inversions in tandem duplications, I won't go through there. I'm sorry. Hello? I don't know. Maybe. Okay. So inversions in tandem duplications are very similar. What you get in these cases are cases where one of the reed actually maps in a different orientation with respect to the order in which they were sequenced. And same here with tandem duplication. So what happens here is the first reed actually maps here and the second one actually maps behind it. And so you get a relative discordancy with respect to the order. Just some feedback. Is it the same in the same buckets? Yeah. You should take it out because I think that's what's causing it. Sorry about that. It's wrong. Try it again. Test. Test. Just a couple of words for me. Test, one, two, three. No. All right. Efficient. Okay. Yeah. Okay. Great. Sorry about that. Thank you. Okay. So like I mentioned, Savant, that's a lot better. Savant has this way of depicting structural variance or mate pairs in such a way that it makes structural variance pre-attentive. So you shouldn't have to do much conscious work to be able to say, aha, that's a structural variance. So your job is then to interpret what kind of event it was based on the order and size of the mate pairs that are being mapped. So what we do is we connect the two endpoints. So the first mapping and the second mapping by an arc. And the height of that arc corresponds to how far those reeds were mapped apart. If it's the case where reeds map in different orientation than we expect, we also color them differently. So they're going to pop out in terms of color as well. We'll do that video. Okay. So some examples. Here we're going to take a look at single nucleotide polymorphisms. You've seen this, this is much the same as you've seen in IGV. So here we have reeds piled up to the location that they've mapped to. They're colored light blue if they map to the forward strand and dark blue if they map to the negative strand. What you see scattered across these are mismatches with respect to the reference genome. And in the case where we have a SNP, there's just a column of consistently mismatched bases with respect to the reference. So that's pretty straightforward and you've already seen this before. We have two special modes called SNP mode. This just abstracts away all the information in clutter that's caused by drawing individual reeds. And here we just show that coverage part. And where you see these mismatches, these are just stacked up. So here, again, you see that column just a little bit clearer. And in strand SNP mode, we just compartmentalize that coverage plot into positive and negative strand. And this is useful for identifying strand bias. Mike brought this up yesterday. Typically, what we've seen is there's a erroneous mismatch that's introduced in one of the reeds and that gets amplified through PCR. And so what you see is on one strand only, you'll get one mismatch that sort of propagates throughout your data. And so here you would be able to easily disentangle that approach where here we're going to support from reeds coming from both strands. So that's a good sign. Okay. So IGV has a way of displaying mate pairs. Here we have a line which connects two reeds that are paired. And so remember deletions get bigger, insertions tend to get smaller. So here this is evidence for a homozygous deletion where we have a set or a cluster of discordant mate pairs such that their mapped distance is further apart than what's expected from the library. And this works really well for homozygous deletions. In just a few seconds, we'll show the heterozygous case. So what we found is in the heterozygous case you actually get a set of discordant mate pairs which are too far apart, but you also get the normal ones which kind of clutter your view. So here we actually have, see some of these are overextended but the ones at the top are normal. And so there's not a clear distinction between those two sets of mate pairs or kinds of mate pairs. So now we're going to look at the corresponding arch visualization of those two events. So this is the homozygous event that we were just looking at. So here we have a pretty clear set of discordant mate pairs. Again, there's no information coming from the center so this is support for the people with homozygous and we can identify the breakpoints here. Any questions about? So these are, this is parameterized so you have to specify to the browser what you see as discordant. Again, it has to be based on the knowledge of your library so you would know that your library has a slot size of 500 base pairs and you would say to so far anything larger than 800 for example I want to apply that as being read. So here's a heterozygous case. It's unfortunate that this is actually a very large event but imagine that this set of mate pairs actually goes off into the distance and comes down at some point. Because it's so large we don't actually draw the whole arch but the concept is the same. So here this is a heterozygous deletion that starts here and I'm hopeful it might not be as obvious on the projection but does it look like this region is more dense than to the right side of that breakpoint? So there should be a relative thinning out approximately half because of that heterozygous event. So we're going to jump to the mate pair and it should switch. So inside of this event we have a relative thinning out of our coverage so this is the right side of the breakpoint so here we are thin and there we're a little bit more dense and you can validate this also by looking at depth of coverage. So depth of coverage and pared end mapping sort of are complementary in a sense. If you looked at the coverage there would be about half as much in that region. It's the number of pairs. So here each line represents a connection between pairs. This is just for fun to just know how the resolution and how good it is. So say you had two leels with respect to the actual reference genome so you have an individual left center side for basically a similar location of adhesion but not the exact same line. So would you see essentially like where they're perfectly matching the reference genome you would have like that nice dark area kind of like that. Well could you have then sort of where that red arc starts could you then have a section where you have that an intermediate density and then a really faint density and then intermediate and then back to could you actually pick that up with something like that? Yeah you could. And I'm not sure if they're in this data set I only looked through a portion of them. You're actually going to get to look at all the structural variants that were called as part of Aaron's lab and there are some hairy events and so it's important to recognize that these events tend to happen together they're not sort of discreet events that could happen in combination we talked about the shattering so they get pretty hairy sometimes. Okay. So I think that's about it. So it's lab time. The first link is using savant to identify single nucleotide polymorphisms and structural variants. So that's this one. Do you want me to come up? Sorry? It's about a year and a half old. You can listen how. Yeah so we found out every time we tried to validate calls it would typically be because strand bias. We implemented that as sort of a 3D feature for ourselves. We can see a new cluster out there. Okay. Yeah so it's all on the Wiki. The first thing you guys are going to do is load a BAM file. There are URLs all posted there. And then the first step is to load. You're going to load the set of read alignments for all of chromosome one for that individual that we've been looking at in the previous modules. Then you're going to look at the structural variance that we called yesterday as part of variance lab. So what we're going to do is import those as a set of bookmarks and we can use that as a basis for navigating to all these positions. And then my hope is that you guys will switch to this arc mode and be able to quickly identify the signatures. So I've annotated each of the structural variance calls with what the call was. So whether it's an inversion or deletion. And so you can sort of cross-reference that with what you see and sort of just make sense of it. And then last we're going to take a look at single nucleotide polymorphisms. So there was a question about what happens with translocations and we can use the arc mode to visualize translocations. What we do is we draw an arc but it just doesn't go anywhere, it just has an arrow. And then if you mouse over that arrow it'll tell you the location on the other chromosome that it goes to. And you can just mouse over and click to jump to that location.