 Today, we'll talk about single integrated variant coding, and I will try to make you learn how to do that, to understand what's the main principle when we do it, how we can improve all data to get the better results, so it will be a reminder of what we've done yesterday, plus some add-on, and how to filter the data. And at the end, a bit of sleep visualization in IGV. So before we start, I will do a shameless advertisement for what we are doing, especially if you are Canadian, we have a strong partnership with Compute Canada, and if you are Canadian collaborators, I really strongly encourage you to go to Compute Canada. It's a free set of high-performance computing resources available for any academic researcher and his personnel. You see there's more than 2,000 CPU cores available, a lot of data, there's one or two new HPC centers that have been opened this year, so it's a really, really big chance we have in Canada to have this resource for free. If you want to apply, you just need to go to the website, the PI apply, then you choose one consortium, you want to go to work, or one consortium, it could be on the West Coast, it could be HPC for us if you are in Ontario, it could be Mammoth or Calcule Quebec if you are in Quebec, and then you choose which server you want to work on. So you will have to do, if you just reduce it, you will have a standard allocation, which is like, I think it's 2 terabytes of data, and I don't remember exactly how many core years of usage you have, so you can go over what usage you ask for, but if you go over what you have as a default allocation, you will have lower priority on the system. If you want large allocation, you need to do every year kind of field of grant, which is like a two or three page grant, which is the deadline is in November usually, and you ask you require, you make your math, you say I have this amount of data, that will take me this amount of space, this amount of processing, and you ask for that. What is cool with that, we have a partnership with them, and at C3G, we maintain a set of resources for Compute Canada, so the idea was to develop a partnership with a system, which is called CBMFS, it's a virtual machine file system, with the ideas, we maintain one location, all the tools, all the genomic resources, all the pipeline for bioinformatics, and they are spread between the different clusters. So if you use one or the other cluster, you know that if you use more tools, you know they have been compiled in exactly the same way than if you are working in another cluster. So we have a lot of tools, a lot of genomic resources, so 14 different spaces, and we also provide our own pipeline, like what we are doing today, you can do the same directly with automated way, or the NSX pipeline. So all of that is integrated to what a consortium we have called GenApp, which is a consortium to manage all this kind of thing, so the CNFS, the pipeline, but we also offer a private instance of Galaxy, where you could have your data on the cluster, so actually the Galaxy is only on CalculateEbec, so you could have your data on the cluster, and you can run Galaxy directly, and the Galaxy job will run the job of Galaxy, the process on the complicated cluster, so you won't have your data that will be on the University of Pennsylvania, I think, or I don't remember which, so the official Galaxy website. So it's a really interesting system, you also have a data hub to share your data or your track with your colleagues. So I really encourage you, if you are Canadian, or if you have Canadian colleagues to apply to this kind of consortium. Okay, now let's talk about SNV, so people are doing genome reseconcing, mainly to map the genetic variation and the structural variation, so to find variants in individual. So this is the main purpose of doing all this kind of analysis, and it is done mainly for rare disease, but also in concern in agriculture and all of this kind of project, and also to, there's a large project like Southern Genome Project here, that do it just to have a global picture of variation in the human population. So to do that, this is the workflow we are working on since yesterday. So yesterday we saw all the data processing, and today we'll show this second part. So if we can summarize this step, quality control, when you use your data, you do a really important to do quality control, then you preprocess your data to obtain the highest quality of your data, you map your data, you process your mapping to improve your alignment, and then you do the small variant colleague. So as I say, importance of quality control. Before you start your analysis, you need to look at your data. What is really important is when you do an experiment, or if you want to compare your experiment, is to use the same protocol and the same instrument for all the different samples. Quite often, I've got people that came and have done a project a few years ago, and they want to do another project, and they don't use the same protocol and instrument, and for sure we see artefact in one group or the other, so that could be an issue. So that's why we have, we see if we don't, if I'm not normalizing everything, we will clearly see some technical issue. So it's really important to do that, especially for SNV, but if you want to compare like condition, case versus control. So what is a single nucleotide variant colleague? So it could be summarized by that. What we want to do is to find some position where we have the refound genomes, and we try to find a specific position where we observe another allele than what we have in the refound genome. It's as simple as that. What is thought to be a little bit trickier is you want to find this type of position, but you don't want to detect this other type of position. So as you can see, and as I said a little bit yesterday, what is really important to do a good SNP calling is to have a good base quality to trust your bases to try to reduce the amount of possible error you will see, and as you can see here, to have enough coverage. We have here, as you might have like 10 coverage, so around 10x of coverage, it's clearly easy to see what is a SNP, what is not a SNP. Now you can imagine if I only take 3x of coverage, so what is done in many samples in the southern genome, they have 3x of coverage, and apparently I cannot make a difference between my SNP and my errors. So coverage, base quality, really important. So base quality, we talked about yesterday, and so the idea of base quality is if you see these two types of region, where you see two alleles, so two alleles with high quality and 12 with low quality, the color will turn to favor calling these variants instead of the other. So how it works? So it is a general workflow for doing your SNP calling, so the idea is you've got your data from your sequencer, you do base calling, you get your FASTQ, you do the read mapping, to map you improve your mapping, and then you have two ways of doing your other calling, multi-sample calling and single-sample coding. So the two methods are a bit similar, but I really encourage you to go with multi-sample coding if you have multi-sample, because you have an advantage. The principle of multi-sample calling is to, at the beginning, to do the step, the work in two steps. The first step, you take the data of everybody, you merge everything together, just to find a region where you could have a condition for SNP. So the more information you will have, the more depth you will create, the more variation you will be able to detect correctly regarding to SNP error, because SNP error will be random between every sample. So you use the idea, if you have enough sample, accumulating the data of every sample, some variant will be with a given frequency in the population, so you will accumulate more evidence of variant reads. So the idea of this first step is to find these candidates, a region, and so each candidate a region to provide posterior probabilities for each sample, and then you do the maximum like you estimate the resolution of each sample to generate the genotype. When we do with single sample coding, the idea is to do everything at the same time. So it's to find every site and to find the genotype at the same time. So most of the methods will approach close to this one, which would be a vision approach, but if you have, for some color, if you have enough coverage, if you have like, I think it's more than a thousand hundred heaps of coverage, some color to be fastest will just use a three-dollar approach, which means you can say, if I have more than, if I have between 20% and 80% of my variant read, I would be heterocygot. If I have more than 80%, I will be a multigot variant, and if I have less than 20, I will be a multigot reference. So really fast, really easy, but it only works if you have enough coverage. Some T-test approach are also done, but the T-test approach is more used, not in a really single mode, but in a paired mode. When you have one sample and you are looking much versus normal, so normal versus tumor, and you just do a T-test to see at this specific position, did I see more variant treats in my statistically more variant treats in my tumors and in my normal? So I tried this to find a somatic variation. So just to give you a quick overview of the Bayesian approach, I will not go in detail. I'm trying to reduce as much as I can every year, kind of equation in my slide, because people then are scared or bored. So as I did it, you just want to find the genotype knowing the data, so which means finding the genotype knowing each sample type. And if we go fast, it's fine, the genotype, based on what you observe at each read and based on the base question. It's why the base question is really important when you do SNP coding. So it's just a summary and accumulation of probability. So I don't go in detail because every color has its own methodology and own formula to do the variant coding. So I could give you one, but she's another one. It's a user for you. So what are the strategies that we know, how we can do the variant coding, what are the strategies to improve this variant coding? So some of them we have seen yesterday, locality element, duplicate marking, base or calibration, population structure. So just a reminder, locality alignment. What we want to do is to avoid this kind of situation where you have some reads with an indel that are pretty much with the indel and the other don't have an indel that show to correct the position of indel, few mismatch in the region. So if you look, you will, if you don't do indel relevant, you will probably call an indel here, a variant here and a variant here. If you do your indel relevant, all the read will have the indel and all the fake variant will be, will disappear. Duplicate marking, what is important to mark duplicate? Because if you have this read here with a variant, imagine this variant come from a second, from a PCR error at the first cycle and you have all these PCR duplicates. So you have around six copies out of eight that show the variant. So you can imagine your variant color will probably call a variant here, which will be false positive. So if you mark all this data as only one alert, you will have only one, one read variant out of three. And in that case, you won't, you probably won't call the variant of this position. Base quality, as I say, base quality is really important in term of, in the Bayesian model. So it's really important to correct, to not have bias. Otherwise, you will see, especially in the, in the sum, so you need to correct for the position of the, of the, the position of the, of the bases on the, on the read and for the general context. So while you will see, you will, you will start to see some positional effect, some context effect in your variant. So what I did in talk yesterday is how we can improve using family or population structure and invitation. So it's a bit correlated to what I talked about, a multi sample coding. So the idea is to use up the types. If you have, if you know that in the population, you have these two prototypes. So the first one, 80G, the second one, 80GA. And now you are sequencing. You have this base, you cannot trust it because you have bad quality. But based on this information, we could probably guess what the value of N is. So what would be the value of N? Be careful, there's a, there's a, I put a color code if pressed too. It's not G, it's T. As I say, multi sample is really a good way to improve your variant coding. This is an old slide, so it is a tool that I've been, that I have now decrypted. But it was, when you see big early plus multi sample approach, the JTK, JTK, so the goal is for population coding, JTK, multi sample, when we do, it was a unified genotyper in a multi sample mode and genotype, unified genotyper in a single sample mode. So you can see that the, the, the, the, the, it's not a rock curve, but it's not a rock curve. It's quite better when you use multi sample approach. Another way to improve your data is to use trios. So if you have a family and you know this family, this family structure, you can use it to get more information because you expect in the child that each allele of the child comes from one of the parents. So you have duplication of the data. So you can use it to, to be, to improve your code. And you can also use the Mandel, Mandelian segregation of alleles. If you see a variant in the child, if you don't see the variant in the, in the parent, it's probably not a true variant, probably, because we know that there are the novel mutations that arrive in the, only in the child. It's why you can also calculate the novel mutation rate. If you see that this, the novel mutation rate is really high, probably you have false positive in your, in your data. So when we do variant coding, we start from BAM file. And usually doing variant coding is not really a big task. It's around 10 hours. We use several color and it will reduce a lot your data. So you will end up with around a gigabyte of data for your variant for general analysis. I know you've got your variant. What is interesting to do is to understand what the format of the variant and how we can filter and annotate the variant. So the format is a BCF format, a variant called format. It's, it's based on the same principles as the BAM file. In terms of you've got a big header that gives you a lot of information and then you've got your calls. So in the header, what is there's two monetary line, the format of your data. So because the BCF format evolved over time, so it's important that the tool can know which format you are using and the name of the column you will use for your data. Now all this information are not monetary, but almost every program are filling this data. So what's this line correspond? It's a correspond to the command that I've been used to generate your data, but also information because we'll see you have your, you have your so you've got your format, your promo position, annotation, reference allele, alternative basis. Then you have the quality. If you have filtered, try to filter your data. It will give you if you filter or not. And then you've got some information about the variants in general and the format of the genotype code. And these two are described in the header. For example, if I take Dp, Dp is the total depth of coverage at this position. So you've got all these information that are on code there. If I go for the format, GT, GT is the genotype. Okay, the genotype, you produce encoded like that with a pipe or a slash, which is zero, zero for homozygot reference, zero one for heterozygot and one one for heterozygot. You will say yes, but I see, but I see a one two or two two. What does that mean? So it's sometimes in the genome, you're in a multi-allelic position. So here's two possible. You have the reference and you have the two alternate solutions. So it will be zero, one, two. So if you have three or whatever, that could be more. So we will have here, it will be GT for that variant at this position. So it's the same principle of zero one, but if you have multiple allel for each new allel, it's two, three and so on. So now you have this data, you have generated this data. You know what you have in your data. What is good to do is to do filtering because usually all the grow variant code contains a lot of errors. So most of the algorithm tend to do really move more permissive in the variant calling. So there's two ways to do your initial variant for filtering. So you can do it manually using a GTK variant filtration or SNF6 and it's what we will do today because we won't be able to do the second approach, which is data, but to do that we need more sample. So the first method is to do so variant filtering manually. So usually you know which parameters you want to do filtering. You know which parameter could influence your false positive and false negative rates. So you know these parameters and you apply and you say I want, like for example, I want to remove every variant that has a depth of coverage lower than 10 because I'm not trusting with the depth of coverage. So you will use the DP field and say okay filtering everything. So this is done by a variant filtering but it's really efficient but it's difficult because it requires you to know exactly what are the methods to do the variant calling and what are the parameters you need to play on. Now if you have a large quote I really recommend you to go with a second approach which is a variant for calibration which is similar to the base of calibration. So the idea is to really learn with a set of known variants what are the good parameters to call variants and then to apply that to the other. I think I have, yeah. Okay. So you can also use for annotation app map in the bisnip to annotate you better and to look for false negative and false positive and I say variant recalibration here. The idea of the variant recalibration which is really interesting and really efficient. So you have your data you have a set of variants so you take a training set. It's a machine learning approach so you take your training set and in this training set of data you have to select from your set of calls. You look in in this training set which are which one are fine in the really known database like the bisnip or me all these variants that are really known and confirmed and which one are not and which one seems to be false. From that you generate mathematical models of what in your data leads to a good variant calling. So you got this model you generate this model when you have this model you apply this model to all your variants in your data and then you need to choose a parameter here for example it could be 99% where you want okay if I apply this rule on all my data I want that all the all my variants that found in app map all the variants I've got in my in my call in my calls that are called as good variant and are found in app map are found in 99% of the case so it's just a way to turn up and turn your your your final method and it's and it works really well to improve your your variant but as I need you need at least to have the I will say there's no magic number but I think under 10 to 15 sample it's hard sometimes it would crash so it's why you need a big amount of a variant to do that. So when you have done this variant cleaning what is good to do is to annotate your data so one of the annotations we use it's to look for a mapability so we know that the reason all the genome is not mappable at the same rate so some region harder to map some not especially due to the repeated region or due to the gc content so that could impact so if you know that you are in a in a region where it's hyper mutated or you expect to have more more read to map because it's kind of all the repeated region all the copy maps there you expect to have more variant and more so for the variant so it's a good annotation to have annotate your variant with non-variant to know if you already is already a common variant or not find the impact of your variant so either find the variant effect or the functional annotation of the variant and if you are working in cancer to annotate with the cosmic databases which is the database of variants implicated in cancer today we will annotate the variant with databasenip and we will annotate the the impact of the variant using snippet so there's also a tool you can do that but we'll do with snippet and there are you use the reference genomes you use the dataset of transcript and how this variant where this variant is located in the transcript and what the one the what the variant does so which amino acid change could impact and which function it could impact to it it will tell you if your variant is coding non-coding and what has the impact so high moderate low or modifier so it's a really good a good way to filter your data if you are looking for for a specific disease or specific variant picking in disease or in cancer and usually we look first at the high and then and moderates because low and modifier usually your variant but it's like no no impact on your protein so yeah i don't have it's really unlikely that this kind of variant will have an impact on your proteins so as i tell you when we do the processing uh it takes approximately 10 hours to generate the variant generating the filtered and annotated variants ghtk snippet is really efficient but what is really important is you looking at the variant understanding the biology behind the variant and that have no values in terms of time and and cost so a small add-on vcf visualization so you can load your vcf in ijv and we'll do it today so you have your read and you can load vcf and so you have a general when you do the vcf a general visualization of the variants the first bar is the proportion of sample that have the variant and then usually usually you have the detail for each sample so you can then look at your actual data to finish to slide two or three matrix matrix matrix matrix um so this is the classical matrix for uh in effect so still when you do variant calling you you need to look at your stack to be able to choose the correct um filter you want to apply to your variant and also you can look at uh some specific stats if we run the snippet it provides a really good uh set of of stats and you can use to where your variants are located uh what type of variant what type of what type of uh transition so this is kind of of matrix you can use to uh see if your what you see is close to what you expect so that's it now we can do the variant calling any question yeah um in uh so could you i'm not sure to understand if the snippet is in example or in the example the snippet will just find find the why is the impact on the product on the country what the change will uh impact the country and we'll predict the impact is the the variant will change the the the product that i mean i see yeah so if it's on the intron or intergenic region so on intron it will try to assess if it's a splice uh if it's if it's a splice variant the snippet will do that if not if it's in intergenic it will it will put an intergenic or entronic and will it will class as a low or a low or moderate impact so what you can do there's not really many tools for this variant but what you can do is to look at um um for example dna's to find our chipset to find if there's a possible transition factor that could overlap the position of your variant but there's no tool dedicated for that so that's the main problem with that kind of um of variants