 Good morning. We'll talk about variant calling as usual. So my first lecture will be on the small variant calling, so mainly focused on the SNV and SNP. So what's the learning objective of the class? So to give you another view of how we can do the variant calling and what to do in the process, to understand the basic principle of that process. Do you know how, when you have done your variant calling, how you can improve it? So before learning it, after learning it, how to filter and update your variants and to be able to code variants and to learn about the format at the end we will give a little insight about how to visualize the variant. As I said yesterday, I will do a small shameless advertisement Compute Canada, my partner. As I said, a lot of, if you are Canadian, academic, you have access to all this CPUs and all this cluster for free. So it's really interesting and it saves us a lot of money. So we have a good partnership with them, especially we maintain the tools, the bioinformatic tools of the Compute Canada, so almost more than 90 tools. We maintain the dynamic resources, so 14 spaces that are available, 20 different builds. And we provide pipeline for analysis, so it's six, it's outdated. Now we are at eight pipelines, so among these, the other one that is missing is the one for the tumor analysis of a tumor normal pair and the 16S metagenomic analysis. So if you are interested to see, this is a pipeline or a repository for the pipeline. We're also part of the JNAP consortium, so JNAP is kind of a hub that tries to link all this kind of stuff. So it gives access to pipeline, which is going to set to private general browser to private galaxies that run on Compute Canada server, and you can build your own project, put your data on Compute Canada and everything will run and you could manage that. So it's really interesting if you are Canadian. If not, you need to come back to see, you need to come to see San Miguel to Canada. More seriously, no, what we will talk today is about variant calling, so why people want variant calling? Because they want to know the genetic variation and the total variation across individuals. All people are interested in genetic variation when they study cancer, they want to do agricultural and a lot of other type of reasons. You want to analyze your data to find variants. So what is your workflow to do that? So three main steps in the workflow, the data processing that we have seen yesterday, the variant discovery and the variant refinement. So when you do the variant discovery, depending on how many samples you have at the time, you can do it on a single, single calling, so by individual sample or joining all your samples. Then as for each step, when you generate your variant, usually you filter your variant to improve your data. And then what we do, usually we do a functional annotation, we also compare our variants to control databases, and then we have a final variant evaluation. So either first on IGV to see if the variant looks real on IGV, and if it looks real, we usually validate the literacy variant through Sanger sequencing. This is not mainly for Sanger sequencing because Sanger sequencing is not working well when you work on cancer. If you have a low fraction of DNA that contains your variants, so if you have clonality or cellularity, Sanger sequencing is not the best method to do the evaluation. So in that case, we go more with high coverage or optimal sequencing method. So the summary, what we've done yesterday. Today we'll talk about the small variant coding. We'll sleep in there and now we can filter and tomorrow and later this afternoon on the structural variant and copy number variation. As I explained yesterday, when you have your data, it's really important to think about the quality of my data. So it's really important to look at your data and to be sure that your data are good and to be confident on your data. Otherwise, you will always have a kind of a doubt in your mind saying, okay, is what I'm saying true or not. So some tips for your quality control that we already saw yesterday, but what is really, really important for variant coding is to be sure that all the samples are being processed the same way. So same library, same protocol, same type of instrument. Otherwise, each technology could bring its own artifacts. And if you have like 10% of your variants that are sequence on another library or another sequence or process to another pipeline, you could have a 10% of variants that will appear on your core 10xL. Yeah, fine, but instead it's just your technology. So this is really important, not at the individual level, but when you want to compare things and when you want to try more, I'll give you a comparison of your variants. So when we do SME coding, what we call SME coding usually is look at indels, but mostly because indel is quite complicated and still not a good, there's still no good method to do it, but most people are looking at SME. So SNPs is the position of the genomes where on the reference genome you have given bases and on your sample genomes, you have another basis or variant basis. What is the goal of the SNP discovery? So it's to make the difference between this kind of position where we are asleep with this kind of position where we are sequencing error. So as you can see, the main parameters that will lead you to make the difference between this different type of position of the sequencing error and the SNP is a good base quality to be sure that you are looking at, to try to lower the number of sequencing error. But also, it's not depth of coverage, because if I have only three X of coverage, like if I take this last three line on the chart here, you can see I cannot differentiate which one is a real variant. So depth of coverage is good and accurate mapping. So yesterday I spent a lot of time to try to show you what's the main part of doing alignment and refinement. I will represent that today too. So base quality, we'll talk yesterday, so I don't need to give you more details. So just to give you an idea when you see a variance, this is two cases where you have a high quality of your data, although quality of your data, clearly this one I will be more confident to call a variant there if I have enough coverage. And here I know that base quality is bad and I know that other base quality is bad in the other reads. So I'm not sure what I'm saying. So really it's important to have a high quality and to remove the low quality basis. So just to give you an idea of the work, so we used to do the SNP calling. So we have the processing of the data, so the concept provides you data in terms of images of each cycle. So you generate your, you do the base calling, you just pass through, you read mapping. Really important, you will find your mapping and then you have to take two possibilities. So you can do single sample calling. So each sample name you are in. So in that cases, what you will do, you will find a location where you have sleep and you type your sleep at the same time. So this type of approach is using Bayesian methods. If you have really high load of coverage, like more than 100x of coverage, some methods switch from Bayesian to a threshold approach because it's more faster to do that. So threshold is just telling you, you choose a given threshold, like 20% and 80% and if you have a read-variant code between 20 and 80% and 80%, you are calling 8 or 10. If you have more than 80, you are calling the number that I got, variant, less than 20. You are calling the number that I got, reference. Awesome, other methods are using T-test approach but this one is more when you compare just normal and tube. So T-test is used more to say, is what I'm saying, what I see in my tube is in terms of read-variant count different from what I see in my normal sample. So this is one method, but if you have multiple samples quite often the case, it's really recommended to go to process to do the run calling through music sample or calling. So in this case, it's a two-step method. So the first day, it's a Bayesian approach where you take the information from every sample and what you are doing in this step, you just try to find position where there could be in whatever sample a snip. So you just find position and this step, just for each position, just provide posteriori probabilities and from this posteriori probabilities after that you do a maximum that you would estimate in each sample to do the general typing. So you use the main point and the strength of this information is to determine the position of the possible position of the target, position of each snip using the information of everybody so you have a lot of information to find out and then you do using the typing. Snip calling. So I tell you that to do that we use Bayesian approach. So as a rule of thumb I try to don't put a lot of equations so I will not go into details that you have it if you want. But just to give you an idea what is the snip calling problem is a way to find the probability that you have a general type knowing you have a specific set of data. So I'm not going into the details but it's mainly based to look at the probability of your data based on which sample types you can have each possible type and this probability is related to what you observe the data based on each base you can observe each base quality. So we always come back to base quality as a major important parameter on your data. So as I say it's really we're not going to detail in this formula but because every color has its own formula, own Bayesian approach to do that so I will not cannot give you the detail of everything but usually so base quality is really important and I say your alignment so the strategy to improve your variant calling as we saw yesterday three of them is to do this different type of process. Localary element marking duplicates base quality recalibration and also the new one that we didn't see yet due to an imputation. Localary alignment really fast we saw here when you have this kind of pattern where you have a mix of indels and snips on the same proximity you can do realignment of your read to see if the snips are not false positive because aligners tend to formula the snips in detail of indel. Just to let you know that you have issues with time of processing some aligner know we do this realignment, local realignment step during the variant calling so if you use apple type color from JTK the one we will use today it's not necessary it's not mandatory to do the local realignment because it will redo each position it will realign the variant. Second is marking duplicates so we saw yesterday how to do that but here it's just to give you more an idea of what's the impact of duplicates when you do the variant calling so you can imagine you've got this variant, this read here that is a duplicate and unfortunately during the first cycle of this year in the library you generate the sequencing error and if you don't count this as a duplicate you will see that at this position almost 60% of my reads shows this variant but then if I mark the duplicate on the left so you can clearly go with a position where you will have probably a strong call of a variant where here you will have either a local or a call of a low quality variant calling quality the best quality of calibration we talked yesterday and I saw you in the way the variant are calling, they use the best quality as a source of the variant color so it's really good to have the original the original best quality that are re-calibrated based on the context to avoid errors so now the way to improve a variant calling that I talked yesterday is the family or the population structure and the invitation so first invitation so the way we work is now that we have a lot of a lot of consortium that have sequence and genotype and thousands of genomes if you are working with some of that came to the same population, the idea is why don't you use the information of the other sequence to improve your variant call so for example, you know that you look at the population and you know that in a specific region you have these two prototypes and in your data you have not so good quality because you have used other methods so you have this you know, you see this to a variant that you are not able to know what is the value of the possible sneak at this position so what you can do with invitation, you can go back to the population and say ok probably have high probability that the variant at this position is that the principle of invitation is not that easy but it's the principle behind the invitation you really wait the example you have given it's almost impossible for these to become on the arcs put in some kind of weighted models for the distance between the known SNPs and imputing the internal SNP based on the recombination you have to put in your models the distance and an integrated distribution between the different positions so because you know samples some of them will probably be assigned differently depending on the location if you are only like 80% of your like that and if you have other combination it will have a kind of probabilistic choice for each sample so it's not 100% sure in that cases when you have 2 other it's 100% sure but sometimes reputation could be wrong as I say important for the distribution of data to do the multi-sample so it was an old an old slide from 2011 so not all this color are still available for doing the calling but it was interesting to see that at this time when people start to think about populations that they compare simple with multiple with single coding, multiple coding and the one who is doing multiple and imputation and we saw that accuracy is quite better when you use the population information another really good way to improve your variance is if you have a family structure if you have trio it's really a good information because usually when you do trio it's not well designed you are more interested in the child in the type and you know that what you see in the cell come from half to one parent and half to the other parent so what you see in the child for almost every variant but for almost every variant the information is dictated in the parent so you have more reach to confirm your call and then when you do a call you can use so if I see almost I got a variant here but if I see this one is almost I got and this one is almost I got a reference it's not possible exactly if you have the normal mutation so and the normal mutation rate is something that people are interested in so you can also type that information from trio so it's really interesting to work with trio of what it takes so BIM file, usually for world genome sequencing around 200 gigabytes and you switch from that BIM file to a raw variant file so to a VCF which is around one gigabyte for world genome analysis way way less bigger if you do a snippet a capture like example a world example and to do the processing it takes around 10 hours for a world genome so now what we do with this variant when we have called them first, the type of data where you will receive your variant is VCF just upon the type of data the VCF format is really the standard for the SNB it's not some people push to use this standard for SV some other nodes and some people think that VCF is not a good format for sutron variant but it's kind of a debate in the field but for SNB it's a really good format so it's composed in two parts, the first part here which has two mandatory lines one which tells you which version of VCF you are using and one describing each columns and you've got all this bunch of information but now almost every color gives you all this information so what this line corresponds to this line corresponds to information and to describe what information will be given in that columns and what information the format will be given for each of the genotypes so the format is you have promosome position ID of the SNB if you have it if you don't have it or if you didn't try to look for the ID you will have a dot reference armils alternate armils quality of the quality of the call if you filter your data, if you have some filter pass under our data pass filter and the info field and the format usually dp for example dp here you can see is the total depth at the position so all the fields are specific to each color and all the fields most of the time will be described here format as I tell you the format columns are separated by the column and most of the time you will have the gt and the dp field in that in this column but after each color stands a different parameter as you want so gt will be the genotype and the dp will be the depth so here we can see that the genotype is given by 0 01 0-1 0-1 so it give you like that it tells you that I got here so when we want to do variant filtering so when you do whatever the color you choose you will have a lot of false positive so the idea is to filter this false positive so there's two ways to do that you can do both but one is more complicated one is more complicated than the other and one asks more background knowledge so you can do a manual filtering based on different parameters using gtk and what you do is to you know your data you know how you call the requirements so decide based on given quality score based on this level of coverage so you apply some parameters to tell you what I expect to be the parameter of a good variant so it's difficult because you need to know exactly what you are doing you need to know exactly what parameters are important so you need to have expertise to do that but it's really efficient and fast so now people are doing more what we call variant recalibration so the idea is to do more in detail of that the idea is to take information of population variant to select set of variants and to learn some specific parameter to good variant and to apply these parameters these results that are learned automatically to all the variant and to remove some of the variants so this could be done variance recalibration when you have a lot of samples because it's a machine learning process so at least you need to have 10, 5 to 10 5 to 10 samples to do to work you can run the command most of the time if you don't have any variants you can do it so today as we are a small region with a small sample we won't do it so also you can use and annotate your variants using databases so the idea is to use this variant to try to find which are the possible false negative and false positive so app map is good quality so you don't have so much variant but you know that this variant extremely works and validated and in the business you know it's a little bit everything so it's more responsive so I say the variant recalibration how it works more in detail so what you do you take your variant set you take a set of validated variant or you know that are highly validated maybe which should be the truth and what you do you randomly select a subset of variant in your data that are fine in this set of variants that you know are of new quality so you define it's your training set so when you have this call set you will learn how what are the rule and what are the different value of the different parameters that have been the value of these parameters to recognize the variants so if your call set the variant that are good and the variant that are bad then you apply the rule to all the sites so you give a score to all the variants whatever they are or they are not in your call set and then you tell him a sensitivity result you want in your call like I want 99% sensitivity so in your call set what are the parameters that makes me 99% sensitivity based on the known variant so that define your result and when you have defined this result you apply it to the rest of the variant then when you have to do that what you do as I say you usually allocate your variant what is important to annotate is the mapability so is my variant in a region where I know it's hard to map reads so we have track to measure that parameters so if you are in a region that you know there's low mapability probably that could be issue in your variant and this variant is probably not really you cannot be too confident you can use the MSNIP to see if the variant have already been reported you can predict the effect of your variant so tool like SNPF there or another can be used so the idea is just to when you have a variant and if the variants overlap with the transcript where the variant is located in the transcript and what the variant has as impact in terms of the proteins so is there an amino acid change is there like for in there is there a frame shift on your protein the MSNIP is another database you can use to annotate for the change so the MSNIP is just a database that shows that this variant has been reported in tool to do annotation as so for non variant as changing this kind of this kind of function in your protein if you work on concern usually you can use Cosmic where Cosmic reports the somatic mutation so it's not an exhaustive list we're talking yesterday there's exact there are also a lot of other databases you can use but it's just to give you an idea of what you can do to annotate your variants today we'll use SPF to do the annotation so how it does it use the reference genome we're going to get an ensemble transcript and it calculates the effect so first is look if your variant is in intergenic region if it's intergenic it's look if it goes to a transaction factor binding site or if it's really like somewhere that we have more information and then if it's not intergenic we're in the coding region in that case you will look at what's the different type that your variant will create on the transcript based on that it will allow you to annotate for each of the transcripts that is overlapping your variants so if you have like 5 transcripts you will have 5 annotations for these variants which are which will be displayed from the highest to the lowest impact we tell you if the impact is high, moderate, low or just modifier so usually when we do analysis at a center when we have a list of variants and we annotate them we then split the group of variants into the high and moderate impact and the low and modifier impact so we start to work with the high and moderate and we don't find anything interesting we start to dig with low and modifier you can do also many other things many annotations so SNPF is a really good tool so just to let you know it takes 10 hours to generate the variant it takes less than one hour to do the filtering and the annotation but making some of it could take days, weeks whatever because it's based on how you can interpret the data just to finish some add-on we see that yesterday IGV so when you open VCS file in IGV so you can open your VCS file in IGV what you will have, you will have a summary of the call here so we display how many the proportion of sample that are showing the variants in the position you have the report by sample and then if you have so you can imagine you can cut there and then you could have the display of your background as usual when we do variant coding when we do every step it's important to take matrix so to collect the matrix so the matrix of the sequencing but also the matrix of the variant so SNPF provides to do this kind of statistics this is interesting for example here it just give you a display of which type of region contains the variants so most of the variations are in the intron but you see some of the intons so in the literal so you can have a kind of visual field data the frequency is a number of base change from one to the other the transition on the version ratio and is a display by type of type of variant I think that's it Is this usual to get like so many variants in intron regions compared to in the same container? Yeah Why is that? It's because in intron you don't have such selection pressure impact so when you have a variant in your example they need to they will have a selective pressure to keep it or not depending on if they provide an advantage or is it with next one I get it I mean interzane because in the intron you don't have a functional impact except if it's like modifiers, splicing modifiers at the edges of the intron it's in the middle of the intron field because the intron will be indicated at the terminal level so you have really less pressure on your variants