 So, I'm going to talk about tools to visualize, I should put sequencing data, and this talk is largely inspired by what Dr. Sornar Marisi developed a few years ago, and I just updated it a bit. So, what do we want to learn in this module is to appreciate the different data type tools, data visualization tools in genomics, to know when to use a particular tool to get experience on genome browser, which is a bunch of a few, and to become an expert in variation and spictions specifically in single nucleotide and structural variants. So, briefly, we're going to talk about visualization tools, genome browser, and an IGV overview. And then we're going to look at an example of how to visualize single nucleotide, polymorphous, and structural variant. First, why do we want to visualize our data? It's quite an important thing to do, and actually you should spend time looking at your data. And we can take an example of this data set, quite famous data set. So, it's four different data sets, which is composed of X and Y values. And if you do a simple metric as you used to do for many data sets, you would have the same average value for Y across the four data sets. The variance correlation between X and Y is the same as well. And you can even do a linear regression that would give you pretty much the same answer. However, if you plot it, it's pretty obvious that those four data sets are not the same. The first one is as a linear relationship. There is a relationship between X and Y as a linear way. The second is still a relationship between X and Y, but it's not a linear relationship as you can observe on the plot. There is actually a curve. The third one has a perfect linear relationship, but has one outlier. So in this case, you would do a more robust linear regression and actually ignore the outlier properly. And the last one, there is actually no relationship between X and Y. And just now, our outlier would drive the correlation. So visualizing your data would help you to spot that and maybe use different metric or process the data differently. So we have two types of visual processing. We have the pre-attentive and the attentive self-processing. The pre-attentive is the one that's going to allow you to actually spot Charlie actually Waldo here on the left part of the plot very easily. The first thing you notice right away, and then you would use your attentive visual processing to actually find it on the beach on the right. So we want to use our pre-attentive brain tool, I would say, to actually spot outliers. So encoded properly using pre-attentive attributes outlier can easily be identified visually. As you can see on this small plot, everyone can spot very easily outlier in every single little graph. So why do we ask? We can claim that the human visual system is actually a low cost and high performance to sense as a sense marker and as a debugger to identify pattern and to identify issue and outliers as compared to writing the code, debugging and running a manuscript. Sometimes it could be quite hard to have an idea of what you're looking for as outlier if you never come across to them. There are different genome browser available out there, over 40 of them. And it really depends on what you want to do, which one you're going to use. The task at hand are you using human data, different plan data. Some genome browser are specific to particular species. The kind of data and the size of data you're dealing with. And if there is issue in terms of privacy of your data, can you use it on, do you have to use it on your protected server in your institute or can you use it cloud-based one? And so those are considerations you need to take into account when you choose which genome browser to use and you can actually use all of them that would bring you different information. In function of your project and your question of interest. For ISO sequencing and throughput data, I'm naming a few genome browser here. So integrative genome viewer we're going to use today that I'm going to present in more detail just after and we're going to use today until I practical. UCSC is a famous genome browser. You can load your own data as well and have your own track display on the UCSC website. Track Star as part of Galaxy allow you to perform visual analytics on small window of a genome. And it's part of this big galaxy which is project and tool which allow you to perform analysis without actually writing with a script by hand but more as an option for every single tool. So that could be very interesting for someone that doesn't want to code so much but still want to analyze his own data. And I mentioned as well the seven genome browser which has a particularity of allowing to visualize your data but do as well some analysis. So when you have a variant you actually can run something as part of seven to look to identify this variant as already been found in another court or to try to predict the effect of this variant which nicely wouldn't allow you to do that. You would use other tool to do that. So what about the Antiquity Genome Bureau? It's from the Broad Institute and I just want to mention that the slide with the Broad logo at the very top that is actually cut on the screen has been largely used. I borrow them from the IGV tutorial from the Broad and so I reuse their materials provided by them at the beginning. So it's a high-performance vision tool for interactive exploring larger data. It's for you to support database and sequencing base data as well as genomic annotation. So it's a desktop application for interactive visual exploration of integrated genomic data set. You can load many different types of data in IGV and these are two, an example of a few such as epigenomic tracks. Stone mark for example, some macro data that we are still a lot out there. A lot of people that you can actually download and display like this. We're going to look at some alignment of sequencing read. You can display your area in a sequence element as well to look at the expression of particular gene or copy number calls when you've done your copy number analysis for example. With IGV you can explore large genomic data sets with an intuitive and easy to use interface. You can integrate multiple types of data and as well clinical or the sample information. What you know about your different sample and you're going to view data from multiple sources locally from remote or cloud base. It depends on what you want to do and your product. There is as well the possibility of using a command line interface and write on a medical script to actually ask IGV to load IGV and ask IGV to go to a particular location. Do something, take a screenshot of this location and then go to the next location and do it again. And so you can do that automatically for hundreds of position. And that could be a very good way of reviewing the particular SNPs call that you had. In some projects you would run different SNPs caller and which are going to agree on many position but will disagree on many other position and you would want to actually look at those to figure it out or is that was it a true call from this caller or it's not a true call and I should not consider it. So that could be a good way of actually viewing a lot of screenshots efficiently. And this is not in the tutorial but I can provide you information with that if you're interested. So IGV, the source of that, the type of that, where the data can be used when you load on the TV. You can get data from the server, from the Amazon cloud, the data that you store locally on your computer. You can use as well a public data set such as the TCGA. So you can view local files without uploading and uploading. You can view remote files without downloading the whole data set. So that keeps it easier to actually not need a huge amount of memory on your laptop but you can still view large data sets in this way. So the basic of IGV, you would load the IGV, you would select your reference genome. You would load the data and then you would navigate through the data going a particular location and for all genome sequencing data you would specifically look at SNVs and structural variants which we're going to do in the tutorial just after. Going on the website, you probably did that already because you're supposed to have installed IGV. You're going to go to the download page and download it and be able to install it on your computer. So the IGV screen looks like this. So the first thing you need to do is to select the genome of interest for you with the drop-down menu and then you will load your data. To load your data you will go to five loads. You can use from the server or from your own file. This is an example of how to use public entering data which is already part of IGV but we will be loading our small bam in the tutorial. So when you have loaded your data, this is how it might look like or it will look different in function of the data type you're actually loading. Just to give you the overview of what are the different panels. At the top of course you have the menu. Then the toolbar which is an important feature we're going to go through. Especially this little yellow sticky thing. Here, this one. That allows you to inactive the fact that you have information that's going to display all the time you move in your mouse so there's something here and you only want to turn the click usually. So that's one of the first things we do when we open IGV for the first time. You will experience it. If you never use IGV, you would experience it in the lab in the tutorial just after that. So here we have the general muller. This is the human genome, chromosome 1 to 22, X and Y. And you can go to a particular location, putting your location in this box or actually typing the name of the gene of interest in this box and you would directly go to the right location. This is the track number with the sample name, the file you're loading. Then you can, if you loaded a natural, which value would have a color code for your different information of the sample. It can be the sex of the sample, of the patient of the sample, your sequence or the subgroup of the particular type of your sequence or whatever clinical or the information you have on your data. Then the track, which you read or your enrichment school and the genome feature at the bottom, which when you zoom in, you will see actually the gene display as you would see in a normal genome project. And you can have a lot more information, type of information like the GC content of the genome or the repeats that are monitored on the genome and all different track you can load as well. So what type of file format can you actually load? Many of them. And the file format actually defines the track type and the track that determines the display option. So I'm going to recognize which file format you're actually loading and will display your data in a particular way in function of what it is. So if it's a BAM, it will already know, oh, I'm looking for read, I'm looking for the quality of the read, I'm looking for the quality of this particular base in the read and I'm going to display it in a particular way. And the full list of the different data types supported on the link at the bottom. So when we have world genome data, we want to use the alignment. If you look at the chromosome at the beginning, you won't see anything because your window is way too large and you cannot load everything. So you will tell you zoom in to see alignment. So how much do you need to zoom in to actually see something? Well, it really depends on what is your data, but roughly 30 kb is a region which is okay to start looking at being able to visualize your read, for example, but it really depends on how deep is your coverage. So the larger the window, the more memory is going to need to actually be able to display it. And if you have even deeper, very deep coverage, then it's even more memory. So in function of how big is your coverage and the memory you can actually have on your laptop, you will need to go smaller windows to be able to see it, to see your read. So when we zoom in, we can start looking at alignment. If the base inside the read match the reference, they are not colored. If they don't match the reference, they're going to start being colored and looking at blue, red, etc., etc. So if you start loading you, when you're bound fine and everything is like a rainbow, it's probably that you didn't load the right reference genomes. If it's mainly gray and then there are a few colors, that makes more sense. So we're actually going to look at those base that are color. And that's actually going to be the one that are of interest, specifically for SNPs. So what are the metrics that can be used to evaluate the validity of SNPs? The first one is the coverage, then the amount of support. Is there any SPS that you can actually detect a strong base of PCR defect? What is the mapping quality and what is the base quality? And all these things can actually be viewed or highlighted with different IGV options. For structural variants, we're going to look at coverage and assessment and read pair orientation, which I'm going to explain in detail just after. So if we want to review a SNP or an SNV, here we had an example of a particular SNP. So we will center our IGV screens at this particular location, which is affected by the dotted black line. So that's why you're really interested. You can see here that you have some red T disc on the screen. So all the reads that are actually still gray, that means they match the reference genome, which is a C. All the reads that are actually of the T later, that means the reader has an alternative base, which is a T here. And you can see at the top, this is the coverage track here. That is Rofi and it's a medical code, which are 50% of the reference and 50% of the alternates, actually more of the T here. You can notice as well some T are actually have a darker red and some are actually lighter. This is a wet one called the base quality. So if you have strong T letters or AOC letters, I mean you would trust it if it's not as strong and a blender or fader. I'm not sure how to call it. You would say, oh, actually I would not trust it as much as this one, for example. So for example, this where a SNP call by a caller. You can call the read in a particular way as well. So here we call all the reads that are on the forward strand in red, the read on the river strand in blue. And you can see on this particular SNPs, only the reads that are going in this direction have the alternative L and none of the red one have it. There would be a need that there is a strand BS that this mutation or this variant is actually may not be a true one, but could be a side effect of the sequencing. And you observe as well that much of them are actually at the end of the read. As I described earlier, the end of the read are more pointer error and is what's one of the pattern you can see here. So you would not trust this being a C. Let's see here. So we want to look at structural, view structural events as well. And now we have pair reads, pair and reads. So we can yell, this can yell evident for general structural events, deletion, translocation and inversion. So we want to color this pair in a particular way to help us to identify this structural variant. And for this, we can look at the infer inside, which I'm going to describe just after, and the pair orientation. So what you expect is this inward facing pair orientation, but sometimes you have other ones and that's going to help you to figure to identify structural events. So the pair and sequencing, you probably seen that before. So you have your DNA that you're going to cut into fragments and you're going to sequence. And you would have read from both hand and roughly the, the insert side between this read is roughly the same for all your library, which is, it tends to be something like 300 or 400. So when you actually, you have your pair read when you actually match them on your reference genome, you would be able to infer the distance between your pair. And this in most of, most of the pair in your, in your library would have roughly the same distance, but some would be a lot further apart, a lot closer apart, and they would allow you to detect structural variant. You're going to go through the section of this variant in detail tomorrow, how to actually, which tool to use and how to detect them. But we can start to be realizing them and see what are the evidence when, when we look at the parts. So, so we can interpret the inferences size, and we can actually detect deletion and insertion, as well as under chromosomal rearrangement. Basically, if you can, if you, if you have arrangement between two chromosomes, so you can start not being able to insert sites, not going to be able to be computed because it's actually on two different chromosomes. So it's going to be in defined and that could be an evidence that you actually, this part of the genome, which has been linked to this part of the genome, that was you have a pair read on between chromosome one and 12, for example. So that could be a clue for other chromosomes rearrangement. The other two deletion and assumption, you will be able to spot them with a shorter or larger insert site than expected. So we're going to see the example of the addition. So what is the effect of deletion on the inferences size? And so here's your reference genome. And in the sample you actually sequence, there is a deletion of the red part in the middle. So that would look like this. When you're going to cut your DNA and do your parent sequencing, you would have, you would have reads mapping on your subject at the bottom next to the junction. But when you map those three back to the reference genome, they're going to be a lot further apart. So the inferences size, inferences size, we're going to be larger than expected, larger than most of your read pair in your library. And we can color that in IGB to actually be able to identify it more easily. And you would have a menu you can color by insert size. We will do that in the practical. And a deletion could look like this. So you would have red paired with at the edge of the deletion. And you could see a drop in coverage at the break points. So pair with larger than expected, the inferences size would be coloring red. And IGB has a color code for this. Larger is red, shorter is blue. And if there are map, you pair map to different chromosome, it would be a different colors as example here. So for rearrangements, your rearrangement might look like this. You would have some pair in a particular position. For example, in your tumor, nothing in your normal that will have this brownish color, which actually indicates you that the mate of this pair is actually mapping to chromosome six. And on the same location on the chromosome six, you would see a blue color that would tell you actually the other mate is mapping on chromosome one. So that would be an evidence for rearrangement. How can we interpret red pair orientation to actually look at detect or validate visually, invasion, duplication, translocation and complex rearrangement? So for this, for the red pair orientation, which I'm going to display just after, we talk about the restraint left to right and the red order first to second. So we will display the red pair orientation in version of different cases that are going to allow us to actually display inversion, for example. So we have the referential. There is a sequence A to B that's actually going to be inverted with a sample U sequence, which will look like this. You will cut your DNA and do the parent sequencing and some pair would actually span the junction at the B position and span the junction at the A position, such as the first one at the B location. When you then map your read to your reference unknown, the part here to read that map to the blue part will map properly here. However, the read that map to this part that has been inverted will not map here, but rather here. And you can observe that the pair orientation is not anymore like this, but like this. Going to both in the same direction. If you have pair reads, for the pairs that are going to map this other pair at the A position, that would be what has been sequenced from your sample. And when you map it, you would have a read that is in the blue part on the right, which has no problem to be mapped as you expect. And the second pair read would actually be going to the A here. So you would see that you have pair orientation that is not what you expect, like you expect the inward facing one, but rather you have the left side pair and the right side pair, which can be colored in a particular way. In IGV would be cyan and blue. So we can go to the IGV and color by pair orientation, as you can see here. And an inversion would look like this. So at each breakpoint, you would have some cyan and blue mate pair. And you can notice a drop in coverage at the breakpoint as well. So in party sequencing analysis to prevent splice junctions, some are really split, but we read the assumption that it must be a gap in alignment. So could you do a similar thing here? Take the unaligned reads, presume a splicing event in between, you could just split them into two times. Yes, so the way IGV is going to dis- is you're going to load your BAM file, so the alignment would have been done by your alignment that is aware of the splicing. And then it's going to be displayed. You're going to have a gray box, then a line, then another gray box. So that means your read actually has been mapping at these two positions. But you know it's the same read. So that's how IGV is going to display it. But IGV is not going to... It's just a visual system. It's not going to be able to map it properly. So you will have to do it properly to get your BAM file with aligners that is aware of. But yeah, IGV is able to display it properly, as you would expect on the exam. Yes. So each read or parent read... You will design this afternoon in the practical when we look at... Yeah. So these are the different type of category of read-pair orientation, as the normal one at the top. And you have the LLNR, which are evidence for invasions, which are like left or right. And then you have the RL, which would be evidence for duplication or translocation. Both of them could actually have this pattern of outward facing. And that's it for the lecture part.