 So in this model, we're going to talk about data visualization and especially some genome browser to be able to look at your sequencing data. So we are going to appreciate the different data visualization tool in genomics to know when to use which tool and what are the benefits of certain tool versus others or in which circumstances you want to use one and preference one over the other to gain more experience with a genome browser and to become an expert ideally in variation inspection. And we're going to look at the example of single nucleotide and structural variants. Feel free to interrupt if you have any questions. So the first part would be about visualization tool, genome browser and IGV and an overview of the IGV, of IGV. And then we're going to look at a single nucleotide polymer face and structural variant in detail using IGV. What about visualization tool? First, why do we want to visualize our data? It's actually an important thing to do when you're analyzing data. You will run a lot of tests and not a lot of statistics, but you always have to actually look at your data. It's going to be really helpful. And a good example of this is a data sensor being produced by a statistician in the 70s, Anscombe, and he created a data set of four sets, which actually exactly have the same basic statistic. So the average of the Y value is always 7.5, the X value it's always 9. You have the same variant, the same correlation and the same linear regression. However, if you plot it, it becomes obvious that those data sets are not the same. And they all have a relationship between X and Y, pretty much a linear relationship for the first one on the top left. However, the one on the top right is not really linear, and you will need a general regression to actually model it properly. The bottom left is actually a very good linear regression plus one outlier, so you will need a robust regression for that. And the last one is an example of how an outlier can actually give you a good correlation, which it shouldn't be. So they all have a correlation of 0.8, but you can observe that the data are pretty different. And the basic statistic wouldn't give you a clue about that. Another data set that actually is driven by a group in Toronto. It's a data source dozen. They took an image of a dinosaur and with this same point, all these 12 images have the same X and Y mean and standard deviation and correlation, but obviously they are different patterns. And it was the same thing. They wanted to show that statistic is really good, but visualization is important as well. So our visual processing, we have to attach the pre-attentive and the attentive type. The pre-attentive is the cognitive operation that can be performed prior to focusing attention on any particular region of an image. So that would be what you use to detect Waldo on the left part of the image. Obviously, you can find him very easily. However, the attentive visual processing is what you're going to use to find Waldo on the beach. But we want to use the pre-attentive one to detect outliers. And so encoded properly using pre-attentive attributes outliers can really be picked by your visual system. So the idea is to display your data in the best way so you can catch outliers. So we can claim that the visual system is a low-cost, high-performance sense marker or debugger to identify outliers as long as you display it properly. So we're going to use tools to display the sequencing data you're going to be working on or you are working on. So there are a lot of different genome browsers, over 40 of them are available. And which one you will want to use it depends on the task you want to have to do, what kind of data you have and the size of your data set. And if you have any privacy issue about the data set, do you have to only work locally or can you put your data on a cloud to be able to use cloud-based server, et cetera, et cetera? So I'm going to mention a few genome browsers such as IGVs that we're going to look in detail and use in the practical just after. So UCSC has a genome browser you probably all of you probably already use it, go online and browser the different genome. There is one as part of Galaxy, Traxter, and seven genome browser is only used, he's always used that a lot, and he has a particularity of being able to annotate your variant as part of the function of the browser. So the one like HIV you're just looking at but you don't have any. You will not be able to annotate a particular variant, for example. So let's talk about IGV. Just want to mention that all the slides with a logo board and seated at the top are being borrowed from the broad website for educational purpose. So that's why, which is where they develop IGV from. So IGV is an integrative genome viewer. It's a high-performance visualization tool for active exploration of large integrated genomic data set and it supports a wide variety of data type, including every base, next-general sequencing, and genomic annotation. So it's a desktop application for interactive and visual exploration of integrated genomic data. You can upload a large range of data type and of course data set from epigenomic data, microdata, sequencing data, vulgenome or anesthetic, or my copy number, etc. So with IGV you can explore a large genomic data set with an intuitive and easy to use in surface. Antiquity multiple type of data, type with clinical and also sample information. We'll see on how this is displayed and view multiple data from multiple sources like could be local, remote or cloud-based and you can as well perform some automatic tasks like loading a script and saying go to this position, change this option and take a screenshot of the data that's placed and go to the next position, etc, etc. And that would be, we will have an example of this at the end of the practical. So data you can use in IGV could be from a server, locally, from a public data set, etc. So it really depends on which data you want to use and how you can upload it from different places. So the basic of IGV, you will load IGV, you will select your reference genome, you will load the data of interest and navigate through the data to be looking at a particular position. So in general you would run your analysis to detect SNV or structural variants, you will have a list of it and so you will go at this position to check that the read really supports what you've just been detected by the program you've been run and this program you're going to use them tomorrow to detect a SNV or structural variant in the practical and the lab tomorrow. So you've all been on the IGV page to download it, I believe, so that's what you must have done to get the version of the IGV. And when you open IGV you have this screen that's up here. So the first thing to do is to select the genome from the top left corner. If you map your genome on a particular preferred genome like 19 and you keep the default IGV 18, you will have a very rainbow color that everything is mismatched. So it's just a clue that something is wrong with your current genome you will load. So the first thing to do then you will load the file for the data. This is an example of how to load files from the server. So publicly file available that are as part of IGV to be able to do some tutorial or test. But in our practical we will load our own cancer file. When you've loaded the data you will have a screen which is like this and I will describe the different components of it. At the top you have the menu as you would have in any tool. Then the toolbar especially you have a box in the middle where you can type the name of a genome of interest or region that you want to go to. I know the option on the right. The first part at the top is the genome ruler and you can select more where you want to go. Then you have a different track with the data. On the left side you have the name of the file you loaded. Then if you loaded some attribute that would be displayed as a column. So it could be if you have different patients sex of the patient or the subtype of the cancer you have or different clinical information about data. At the bottom you have the genome feature. The basic thing is just to show the genes. For example, you can load the UCHG genes in sample one or repetitive element tract as well. So different cell format can be used and the file format defines the track type and the track type determines how it's going to be displayed. So automatically IGV is going to recognize what type of data you're loading because you're going to look at the file you're loading like if it's .bam, he knows it's a bam, so by default he would display it in a particular way. And then you can modify a bit the option to what you want. So here are a list of all the file tabs that IGV recognize. A lot of them, a lot are very useful and they keep expanding and functional how the field is developing. But usually the file you will be able to open it in IGV. So when you start loading your data you will have, you won't see much because you need to zoom in to be able to actually see your reads. So that's why you have zoom in to see alignment and how what is the region you need to go, how far you need to zoom in to be able to start seeing reads. It really depends on your coverage and the memory you can use. So roughly 30k. But if you want to use a large window you need more memory and in some of your laptops it might be too demanding for them in terms of the memory you did. However if you have local reach, so not a lot of reads, you might if you can use larger window because you need less memory because you have less data to load. But if you have very deep sequencing you're going to have to look at narrow window because the amount of memory used is high. So there is not a particular region size but just you know you have to zoom in until you can actually see something. So when we load a BAM file we have alignment. What do we see? We see reads that have been aligned to the genome and we see some colors and the colors correspond to a base that are actually mismatched. So they are not the same as the reference genome and the color and function of the being A, C, O, G, O, T. Right. If the base has a strong color that means the quality of the base was very good. If it's quite, if it's a light color that means the quality of the base was not very good. So when you look at a particular Sniper SNV, if all the base that are the mismatch not a strong color you would not really trust it because it was not a good quality base. So what are the important metrics we want to use to evaluate the validity of SNVs? We're going to look at coverage. Of course you need enough read in a particular region to be confident that there is a mutation. You want to look at the amount of support. How many of my reads have the mutation versus how many of my reads don't have the mutation? It could be a low variance frequency, but still you need more than to read, for example, or even more than that. We can observe if is there any strong base of PCR artifact. We're going to check that it's a region with a mapping quality is actually good. So the mapping quality is shown in IGV by the gray color. So all reads that have a good gray color as on the screen are good mapping quality is getting more and more whiteish to white when the mapping quality is bad. And the base quality is what I mentioned before. If a base has a mismatch and the color is not strong, that means the base quality was not good. So it's how those are included in IGV. And the important metric for evaluating a variant are the coverage, so we insert sites and the repair orientation. And I'm going to describe those two insert sites and repair orientation in detail just after. What does this correspond to? So when we want to look at SNP or SNVs, this is an example of a good SNP. So you can see you have a mismatch, a T to a C. The reference there alone is a C as you can see at the bottom. And you have a bit more than 50% of read that actually have a T. And they all have a strong red color, so we're confident that it's a real mutation at T to a C. And it's actually if you load at the bottom the DB SNP track, it corresponds to a known SNP, because there is an annotation at the bottom. I wish I could point it. Another example of a SNP or variant mutation that you would not necessarily trust. In this case, we have a region, a particular position with a C to an A. However, we color here the read in the orientation of the read. So the read that go towards the left are blue, the read going towards the right are red. And you can see that the alternative allele is only on a particular strand with orientation. So you would want read going in both sense supporting this mutation, so you would not necessarily trust this way. What about the view and structural events? So period can be an evidence of genomic structural events such as deletion, translocation, and inversion. So we can color the read pair in function of the infest size, insert size, sorry, and the pair orientation. So what about the insert size? So we have the DNA that you want to sequence. And then it's been fragmented. And then what's Jared described earlier. And then the read, you're going to have pair, and most of the time you have pair and read, so you're going to sequence on both hand of each fragment. So that's the arrow, what represents the black arrows. You have sequencing on both hand of your fragment. So the infest size is the distance between those two reads that belong to the same pair. And roughly, all fragments have the same length, so the insert size should roughly be the same for all pair. And you can, when you have your pair with you align them to your fentanyl, and you can compute the distance between those reads, so that the infer insert size. So we can, the infer insert size can be used to detect structural variant, such as deletion, insertion, and enter chromosomal rearrangement. We're going to take the example of the deletion. So what is the effect on the deletion on an infer insert size? At the top, you have your referendum. At the bottom, you have the sequence you're actually sequencing, which has a deletion as part in the middle. So the red part in the middle has been removed, and you have a, now a junction between those two parts on the right and on the left. So if you fragment it and run your read pair, you might have a pair that cover both hand of the fragment. If you map this pair back to your fentanyl, it will map further apart than your real sequence. So the infer insert size is larger than the expected value. So this is an indication we can actually use to detect deletion, for example. And we can visualize it on IGV. So you will do that in the practical. You will, we will ask for the color alignment to be color by insert size. And for example, a deletion can look like this. You have a drop of the coverage. As you can see, there is less read in the middle. And you have some red read at the border, at the breakpoint, that actually have larger insert size than expected. So telling you that there's a good evidence for deletion here. So that's, by convention, IGV will color in blue, the insert size that are smaller than expected, and red, the insert size that are larger than expected. And then if a read pair is mapped on one, when read is mapped on one chromosome, and the other read is mapped on another chromosome, they're going to color the reads with different color in function of where the mate is mapped to. So they have one color per chromosome. And this is how you will see some rearrangement, that the color of this read indicate that the mate of this read is actually on chromosome six. And they were the original read where map on chromosome one. And this versa on the right side, you can see the read mapping on chromosome six and their mate are blue, or you like blue, because their mate is mapped to chromosome one. So if you have region like this with color, you might, it can give you a clue that they're probably on rearrangement. And you actually can click on it to actually jump to the other position. What about read pair orientation? He can help revealing structural variants such as inversion, duplication, translocation, and complex rearrangement. And the orientation is defined in terms of read strength left to right and read order, first and second. I will exemplify that. So if we take the case of an inversion, so you have the sequence A, B on your reference genome that is actually inverse in the sequence, your sequence. So it's a B to A. Then you will fragment your DNA and have some read pair sequencing. So we take the example of the first read pair that is around the B junction. When we might be back to the reference genome, the part on the left would map to the blue part close to the normal sequence. So it's not a problem. However, the part on the right will map to the reference genome close to the B. And you will have an orientation that is actually in the same direction as opposed to being like this in theory. If we take the example of read pair spanning the other breakpoint, the read on the right will map without any as expected. And the read on the left will map in the opposite direction on the reference genome. So this is how you read map Alan read would look like. And you expect the orientation being inward facing and not in the same direction. So this you can actually identify it and we will color it in a particular way to be able to highlight it. So the science, they will color in silent when they have the left side pair because they're going from left to right. And the blue one would be the right side pair. So on IGV we can color alignment by pair orientation. And an example of innovation would look like this with both left side and right side pair at each breakpoint. Like the cyan and the blue player. And you would notice a drop in coverage as well at the breakpoint. So that's the convention for IGV with the normal read pair that are in the right orientation. And then you have the in the same orientation, the cyan and the blue. And if you have over facing the last one, the green one, exemplify a duplication or a translation. So that's how you would, all the different options you would use to highlight the events you detected and check them on IGV. And we're actually going to actually do it in the practical now. Any question before we start the practical? Yes? I assume they have a different color coding. They must have a color coding for every single genome support. So you need to load the reference genome. So I don't know if IGV would load the bacterial genome or whatever genome you want to look at. But they would, if they support the genome, they would have a convention for which chromosome is rich color.