 Okay. So before we get started with the lab for module two, for module two, I just wanted to go over a really quick presentation about an introduction to the software that we're going to be using IGV. In reality, we could probably give an entire day's workshop on how to use IGV and the different things that we use, but we only have an hour. So I just want to give you kind of the basics before you get started on your own. So Richard mentioned a lot of different file types. They all seem to be, for the most part, pretty text heavy text based. But what do we do with them and how do we interpret them. I don't know about you but I'm definitely a very much a visual person. Text doesn't mean a whole lot to me it's a lot easier if I can actually look and see what's going on. So one of the things that we can do is visualize some of this genomic information. We will be using IGV for this lab. It's one of many genome browser tools available. UCSC Genome Browser is a really popular web based one, but IGV is a pretty popular kind of downloadable one as well, and it supports many different file types. You can look at DNA alignments, RNA alignments, DNA methylation data, gypsy data, VCF files and many more, but I think most commonly IGV is used to view DNA and RNA alignments. And these files can be either stored on your computer or you can actually pull them from the cloud if they're hosted online. So once you launch IGV you should see a screen that looks like this. The only thing you want to do is make sure you have the correct reference genome selected. So this is a pretty old screenshot here. At the very least you should be using HG-19 if you're working with humans. If you don't have the correct reference genome. Everything is going to look very wrong. Your alignments are going to look terrible. So from here you can then load your data. Again, as I said, from a file and computer or from a URL or server. This is an example of some methylation data, I believe. And here is a breakdown of what everything is. So at the top you have your menu bar like you do with pretty much any app. The tool bar along here. This is where you load the reference genome. This is where you load the reference genes to zoom in and out of the different genomic regions. The search bar allows you to type in genomic coordinates to quickly navigate to regions of interest. And there's this genome ruler underneath, which indicates which region of the genome you are currently viewing. So when you first start, it'll give you an overview of all of the chromosomes. When you actually view into a specific region, you'll get an adiogram of a specific chromosome. And then you'll see the tracks. So each data set here. These are the actual data that you've loaded into the data panel, which takes up the bulk of the screen. Each data set gets a separate track in the middle and the track names are indicated on the left hand panel. There is an optional attributes panel. We're not going to be using it today. It is available if you want to, if you're looking at a bunch of different samples at once and you want to distinguish them by certain phenotypic attributes or metadata, such as the tumor sample or if it's normal. And then there's this genome features track, excuse me along the bottom. And it has different annotations of the genome such as the location of rough, rough, rough seek genes. Once your file is loaded, if you're writing a BAM file, you'll see a coverage track as well pop up for each BAM file that you load. And this shows how many reads are covering any particular region that you're looking into. You're going to have to zoom in to see the actual alignments. Because BAM files can be extremely large, contain, you know, several gigabytes of data, and trying to load that all at once onto your computer is going to absolutely freeze and cause problems. So if I showed you this screen here and said how many variants on the screenshot. How can you tell, like how quickly can you see anyone give me a number. All right, I'm not going to make you do that. How about now how many variants can you see for. Or there should be five there's four C's. Oh yes and one. Yeah. So it's a lot easier to identify anomalies when only the anomalies are highlighted and IGV makes really good use of color coding information in order to make viewing variants easier on you. So the default settings are each read is a gray arrow, and they are aligned to the reference genome. The reference genome was displayed at the bottom, each base gets its own color. If you zoom in far enough you'll actually see the letters in both the reference genome and in the reads as well, or in the mismatch reads at least. And for the reads, they are gray if they match the reference and or sorry in the reads the bases are gray if they match the reference so they don't show up, but they're colored if they don't match. It might be a bit hard to see in these slides but the mismatched bases are also different intensities depending on the base quality score as was determined by the sequencing technology. The solid colors have a very high base quality, whereas these really faint colors. There are very low quality and probably at you know sequencing error or an artifact as opposed to an actual variant coloring the reads themselves can also help you identify artifacts and structural variants as well. And here the reads are no longer gray, but they are red and blue for forward and reverse orientation. So if you look at this position here. If these reads were okay you would say okay yeah it's pretty good indication that there is a single nucleotide variant here. But if you look at the orientation the variant only occurs on the reverse read so it will kind of make you question the validity of this particular step and you can maybe rule out certain false positives this way. So, taking a step back as a refresher during a paired and sequencing library preparation DNA is shared into fragments, and only the the ends of the fragment are actually sequenced. So, you know from one end going in the other end coming from the other direction so we expect the read pairs to be oriented left to right, when they're mapped back to the reference genome. So if the reads in other orientations doesn't mean that something has gone horribly wrong during sequencing. It's instead it suggests that there's a structural variant in the sample that causes the the read orientations to look a bit weird when those reads are mapped back to the reference genome so basically the that sequence of that particular segment is different in your sample than it is in the reference genome. So, these left left or right right pairs are seen in inversions and are colored teal and blue, respectively, and I don't have time to go into you know the specific details of how this all works with the IGV user manual online gives a pretty good thorough explanation. So if you have reads kind of facing the opposite way if they're going right left. I, this might be indicative of a tandem duplication is depicted here, or as a translocation within the same chromosome. So, in this example. There is an inversion. We've colored the reads by repair orientation. And so we can see there are a bunch of reads in incorrect orientations but their mates as well you can look at the coverage track and see that there is a drop in coverage around the same position. So you can use multiple tracks to kind of confirm or validate what might be going on in a particular sample. And also, general line library prep the fragments, you generally undergo size selection to try to select fragments that are roughly the same size. And so that means that they have your repairs have an expected insert size so the insert is the distance from the end of the first read to the end of the second read. So, when you map your reads back to the reference. A lot of aligners will use this expected insert size to try to find the best mapping if there's multiple places it can map. But we can use information about. Anyway, you know, insert reads, repairs with insert sizes that are either smaller than expected or larger than expected, in order to detect, detect structural variance. A fragment originates from a region with a large deletion or insertion the inferred insert size or the insert size when it's mapped back to the reference genome. It changes it's either going to get larger if it's a deletion or smaller if it's an insertion and hopefully you can see why in these figures here. So, read pairs with a larger than expected insert size are colored red. And once with a smaller than expected insert size are colored blue. As well, inter chromosomal rearrangements can be detected when when a read mate maps to a different chromosome, or sorry when a read maps to a different chromosome from its mate. Both pairs of root or sorry both reads in a pair to map to the same chromosome at least, but if there's a translocation, they can often map to different chromosomes. And when this happens, the reads are colored according to which chromosome their mate is aligned to. I'll show you an example of this in a second. An example of looking at a large deletion. So you can see that the read pairs are colored red because these particular reads of these pairings have an insert size that is larger than expected. There's also a drop in the coverage track again. So we can see that. You know, this is this is a deletion there are fewer reads covering this region, and we have these larger than expected insert sizes. So we can kind of feel that there's a deletion in this area. And an example of a rearrangement or a translocation in here. We're actually looking at two different regions at the same time with an IGV and a split screen, and we're looking at two different samples so this is a tumor sample up top, a normal sample underneath. In chromosome one. Some of the reads in the tumor are colored orange. And when we look at the chart of the. Sorry, if I go back a step. If reads are mapped on different chromosomes from their mates. That's an undefined insert size you can't quantify that difference because they're on they're on different chromosomes. So this undefined insert size. And the legend here will tell you where which chromosome it's made is mapped to so orange here means chromosome six. And sure enough, if we look up. Look at some six and see the reads that correspond to these roots here these ones are also colored blue indicating that their mates are mapped to chromosome one. What do we have to keep in mind when viewing alignments from cancer samples. I think the main thing would be considering that for true germline variants you would expect the frequency of a variant allele to be either 0.5 or one. So roughly half of the reach to carry a variant if the sample is heterozygous at this position, and nearly all the reach to carry a variant if it's homozygous. But semantic variants have a variable frequency, it can be very low or very high and anywhere in between due to tumor heterogeneity, as Trevor talked about in his intro lecture. Sometimes you need really deep coverage or high coverage of your sample in order to identify these somatic variants if they occur at a very low frequency. And it's not always easy to tell if a variant is somatic or if it's actually a sequencing artifact. And so that's why we don't really do things by I use cancer specific software to confirm these variants. And you'll use a few of these tools tomorrow and in the next day to actually determine variants from cancer data. But it can be helpful to view potential variants in a browser like IGB, either to confirm a variant or identify signs that a variant might be an artifact or a false positive.