 Good morning, everyone. OK, so let's continue to explore what we can do from DNA-seq analogies and how now we have seen the previous model, how we can generate good quality back, how we can move forward by doing some coding. So the first model will be on small variant coding. And later on today, we'll talk about structural variant coding. So what I hope you will learn today is to have an overview of the pipeline that we use to do that, to do the SMD coding, so single iterative variant coding. This pipeline is mainly inspired by what have been proposed in the JTK-based practices. I hope you will understand the basic principle of variant coding, why I say the basic principle, because each variant coder uses different methods, different algorithm, different formula probabilities. So I cannot give you so much detail on that, but you need to understand how it works. To understand how we can have the best variant coding, and we already talked about that yesterday, why when you try to improve the BAM, how this have an impact on the variant coding. So we'll revisit this part plus other parts when we can also use also other methodology to improve the variant coding. We'll see how to filter and annotate the variant. And also, we will learn about the VCF format and a bit of visualization, but you already see it yesterday with AMDA. So before we start, just a kind of shameless ponder talk before the real meet. For all the people that are Canadian, I would like to introduce you a non-profit organization which is called Compute Canada, which is formed by federal government. And the goal of this organization is to provide high-performance computing resources to a researcher in Canada for free. So all that we do, you can do. And us, we use it a lot. We have our own cluster internally, but we use it a lot. So it's a really, really good opportunity for all Canadian researchers. Especially, you can have a free account with just going there, just understanding them and registration. If you want more allocation, more resources, then you need every year to write a small proposal. And most of the time, you will have what you ask for. So they are HPC all over Canada. So it's cool. You can have talk with your local HPC. And why it's cool is because we develop a partnership with them. And we are in charge of maintaining all bioinformatics tools in their systems so to install all the tools to be installed in the same fashion in every cluster. We also provide genomic resources to open bioinformatics and genomics field. And we also provide all the set of well standardized and high quality products that people can directly use. So all what you see today, you can reproduce on HPC Canada using the pipeline. If you want more detail on the pipeline, this is where the pipeline are hosted for the code. And you can also send us an email. All this initiative is part of a GenAP consortium that we are part of, where the goal is to standardize genomics resources for the community and also to offer for people that don't want to code another type of analysis facilities with Galaxy. We don't develop Galaxy, but we have installed a Canadian Galaxy version, which is directly working on Confidia Landmode. So it's more efficient. And it also allows you to create some kind of data hub to share your data between your collaborators. So now I'll finish my window talk. Let's talk about small variant calling. OK, so why people are doing small variant calling? So it comes from all the genome sequencing effort where people generate this large, massive project where they try to sequence a lot of people. So what they want to do in that case is to understand what are the individual variation of each people, but what are the shared variation between people. So it's why the interest in variation come to become a really major task in the field. It's also really important when we do cancer genomics. For example, we are part of a different consortium, one which is called Profile, where we have some kids and young adults that are resistant to traditional treatment in cancer. And what we have, when we receive a sample, we have two weeks to find their variation, their somatic variation, try to provide this variation to clinicians so they can adjust their therapy or try the new therapy. So it's really important to be able to have a good quality of variant calling and to be confident what we provide. It's also used a lot in agricultural crop and other many fields. So as we described yesterday, this is our main workflow for the DNA second variant calling. So yesterday, we saw the first part, the data processing. And today, we saw this part, which looks longer than this one, but it's not longer. It's more complicated. But in terms of steps, it's a way less step to do, but more reflection to have. So a summary of how we work. So as I say, when you want to do this kind of analysis, the first thing to do is to do quality control, look at your data. If my data is good, then you do processing, mapping. And then when you have a real good mapping, you do some small variant coding and some structural variant coding. Why is it important to have good quality control? Because it's really important to, if you want to compare samples, you need to have a good quality of your sample. Because otherwise, one with lower quality will have some technical and not typical variants that will appear. So it's really good that all of your samples are good quality, but also all of your samples are coming from the same protocol for the same instrument. So it's really, really important when you do that to think about that. We saw some people that go generate some data and then look in databases for variants and say, oh, I don't find my variant. But in that database is, for example, you don't use the same exome keys than you. So they don't cover the same region. So it's normal that some variants are missing. But you need to know. You need to understand all this concept before trying to compare what you see in one sample to the other. So it is, as I say, it is really, really important if you want to compare different samples or different conditions. This applies for the NSAID, but this will apply for RNA-2 and all your sick technologies or projects. So as I say, we will look at small variant coding and we'll first look at how we can do the real coding. So when we do single nucleated variant coding, what we want to do, we want to find some snips or some region in the genomes where you have your reference sequence and you want to find some position or in your sample, what we observe at the same position is different to what we have in the reference. So it looks simple like that because it's all, yes. It's obvious when you look at this graph. So here another way to show that, where it starts to be still obvious to say, OK, I've got a snip here and you can see we have a meter that I got snipped here. And here so the goal is to make the difference between this position, which is a real snip, and this other position, which are sequencing error. So what we can see from that first slide is one of the main important factors to do that is coverage. If I got enough coverage, I should be able to find my snip. Why? Because if I have a real snip, I will have an accumulation of evidence at the same position where sequencing error are random. So you don't expect to see all these sequencing errors that appear at the same location. For sure, there's other type of error artifact kits that could generate these non-random errors or non-random false variants. But in general, the concept is there. So why coverage is important? Because imagine now around 10x. So now imagine I have only trees taking only the last sequence. Making the difference between my sequencing error and my snip become really, really complicated. Another important parameter to take into account when we try to do a snip calling is base quality. So I will not really explain what is base quality. We have some of it yesterday, but it's really important. Because when you start to look at a read, if we saw, OK, this is my read, this is my reference, if I say, OK, I've got here a variation, which is really high quality, it could be a sequencing error or a special error at the beginning, for sure. But I will have more confidence to say, oh, this is something that's really in my read than here. I have a lower base quality. And is it really a sequencing error? Is it a real variation? It's more tricky to understand. So it's important, and especially we'll see later that most of the variant callers take the base quality into their formula to do the variant calling. So when we do snip and genotype calling, the summary, we have the machine. We do the base calling. We generate fast queues, and we map our reads. Then we improve our alignment. And then there are two ways of doing the variant calling. So you can do either a single sample calling. So each sample you will do, you will call the variant individually. This is done when you have only one sample or when there are some specific, for example, tumor, usually you do single sample coding or paired sample coding, which means one sample which is normal blood. In that cases, you try to identify the snip and the genotype at the same time. So you use Bayesian or treasured approach. So treasured is when you have really, really high coverage. You can just count. And if you have between 25% and 75% of your variant reads, you say, OK, I got a treasured icon. But if you don't have super high coverage, most of the method will use a Bayesian approach or a T test approach when you look at some somatic color. But this is not the best way to do it. If you have enough sample, you should go with multi-sampled calling. So what is multi-sampled calling? The idea is to take all your samples together and to mix the information of all sample, not to determine the genotype of your sample, but to find all the possible region in your genome where it could possibly have a snip, personally. So using the information of everybody, you will find all the positions that you there are probably some variation. So the idea is really to increase your level of information. Then you get your target, and then you develop a maximum-like-you-would-estimate approach where you will go back the information to each sample and do the genotype thing. OK, at that position where I think there is possibly a variant, this one will be reference reference, this one will be treasured. This one will be almost I got after it. So it's work way better than doing in a single sample calling. Is that not using your data twice? Sorry? Is that not using your data twice? No, because you don't do the same thing. In the first part, you really mix all the information to say, OK, I have enough evidence to say this region is probably container variant in some samples. I don't know why. I don't know which one. It's just that I take all my read and say, OK, I accumulate evidence of variant read, and that's it. And then you go back and you try to decipher the signal you have used by sample to do the genotyping. So you don't use exactly the information twice, but not for the same purpose. So I will not go in detail because I know a lot of people, when I start to explain some mathematical formulas, they start to slip. So just so you know, when we do variant coding, we try the ideas to try to find the given genotype in a sample knowing the data will serve at the sequence level. So basically, it's meant to understand the likelihood of each genotype depending on the data, and which is correlated to the likelihood of each of the types. So you use, with all of the color, use Bayesian approach with a different type of formula. So what is important to understand is that this likelihood of appetite is driven by how many base you can't actually hit the low side, and this is driven by the quality of the base you are looking at. So it's why keeping high quality is important when you do variant coding. Nobody slip? OK. So as I say, it's not important to know perfectly each formula because each color will use different formula to the variant coding. So when you have choose one, you will be able to go more in detail. What is more important, I think, to understand is how we can improve the variant coding. So this is the four main way to do it. We yesterday already see three of them, but we will go over all of them today again. So the first thing is to do a local realignment. Why? Because first, it really improves how you call indels. That being said, there is no real good color for indels. I want you to know there is no really good color for single nucleotide variation, but indels are still really hard to call. And this one improve how we call indel and improve also how we call snip by reducing a lot of false positive. So as I say yesterday, a lot of aligner tend to favor introducing mismatch instead of introducing gap. So we quite often observe this kind of region where we have some accumulation of variation depending on how the read is mapped, so possible variation. And usually, we have some other in the same region, we have some other snip that show insertion. And when we saw this kind of web pattern, the software will try to move the different reads and see if it's give a better alignment of the region. And when you see what we do before and after alignment, the two close nibs that were here completely disappear when we introduce the indel for almost all the reads, which means that there was a real indel there, probably a namazigus indel. And by realigning the read around this indel, we remove two probably false positive snips. So it can really helps. The other type of improvement we can use is to mark or remove duplicate reads. So just look at the top cartons here. So this is what you will face in your data. So we'll say, OK, I'm looking at this region. I've got my reason. Oh, I've got an accumulation of reads that show a variation. So a lot of evidence of a variation in that position. So clearly, if I'm not looking at and taking into account some external information, I would say, OK, I will probably have a variant there. But it will be a false positive variant. Because if you look, wow, this is exactly the same DNA fragment, the same fragment originality. So if I only count twice, oh, then I only have one evidence. And it's really more unlikely that this is a real variant. So this is why it's important to mark or remove duplicates. Best quality of a calibration, it's hard to show by a scale or a cartoon how it impacts. But it just based on, we saw the formula at the beginning, Bayesian formula really rely on best quality of each basis. So having the best model as the best quality on your data is really important. Now, a new concept is to improve the variant coding is using familiar population structure and imputation. So what I mean by that? So imagine from your population are using large population project, like Nomad, or like UK Biobank, or something like that. You know the prototype in your population. And you've got these two prototypes. Then you are doing your sequencing in your sample. You know that some sample come from the sample population. And you face this situation. So what would be the value for n? Be careful there's a small trap, a color trap. Don't trust the color. OK, so you will say, OK, probably might have a lot of, there's a high probability that at this position, I've got a t. So it's what we do when we call imputation. When we have this region where, for example, we have a lack of coverage. We don't have enough coverage to be able to call a real variation, to have enough information to call variants, then we can use the population to be able to grab the information from other resource and to introduce that in your data. When we use multi-sample coding and information, this is an old slide. But just to give you a look at this amount, is the accuracy of a genotype coding when we do it between multi-sample, single sample, multi-sample, and population approach. So we can see that we have a way more higher accuracy on doing coding using multi-sample and population approach. Another way to improve your variant call, which is usually made after you do your call, it's to use a family structure, for example, trios. In that case, you have two parents and a child, and you are usually interested in the genotype in the child, but you genotype the three samples. And what you will use, by using the structure, you will take advantage of duplication of data as we do for the multi-sample coding. But what is interesting in that structure is you can also integrate additional information, like the mundane segregation of alleles, which means that each allele that you will find, but almost every allele that you will find in the child should come from one or the other parent. Now there are some also the novel mutation, and it allows you to detect, if you see really high quality variant that is not found in the parents, and you really trust and validate, you are also able to estimate the novel mutation rate. So a lot of additional features you can add in your data. So just to give you an idea, when you do variant coding, we start from Banffal with a mean 200 gigabyte, but it goes less or really more. And then we end up usually for world genome of one gigabyte, which is the raw VCF, so VCF mean variant called Format. And there are many, many tools to do that. Mostly we will use GTK, which is one of the most performant color, which is cohabotype color. But other color exists, like some tool, freebase cortex, and one of my main message here. I'm talking about traditional DNSSEC. If you are doing cancer, forget about that. It's a completely different type of variant color, because in cancer there's a lot of other many parameters to take into account, like purity, cellularity, so this color are not dedicated to work with cancer. So here it's really the color dedicated to do a traditional world genome of one exam, a variant called. So what is the VCF file? So VCF I say mean variant called Format. So in your VCF, and yesterday I don't talk about that, but it's the same, not the same, but there will be always an either in your file. So you will have a set of line with different information before you have your real variant call and genetic information. Yesterday when I present the BAM file, BAM Format, I don't tell you that, but it's the same. In all BAM you will have an either, which is not the same as the VCF. And what is interesting in that either is that it contains many, many information to understand your file. And it contains also all the information, all the steps of processing to generate your file should be in the either. If you use standard tool, they will always put what they have done in the either file. So if you receive a VCF, if you receive a BAM, you will be able looking at the either to know how this file has been created. So here you've got, so when you have the double font sign, it's really to describe the format. So you've got which reference, you've got info that correspond to that field. So, you know, all this info here will be OK. NS here mean number of samples with data. You know deeply will mean depth, total depth. So you have all this information where you are able to try back from the either. So when you look at the data, so you will have, for each variant, you will have the position, so which chroma dome, which position, an ID. So not all VCF will have an ID at the beginning because when you do coding, you will not know where is the data as an ID, if the variant exists or not in database. So you will have to fill this ID field. What is seen as reference? What is seen as alternate alias? The quality of code. If you apply some filtering on the data, it will tell you if you variant pass or not this filter. So at the beginning they could have some variant color will already do some pre-filtering for you, so this information could be filled or some other will not do. They will just call and you will have to do filtering by your own and in that cases it will be just a dot. Then you've got the info field which give general information of the variant calling for all sample. Then you have the format field. So the format is just to describe what will be in the next column for individual genotyping. So here it will tell you you will have a GT which is a genotype which is always there and you will have other like DP which will be the depth in each sample. And then you apply, so you see all the fields are separate here and you see the same separation and you can have your different information. What is a GT? GT is a genotype. So you've got numbers. So a genotype would be either like that, 0 slash 1. So the number will be in general slash or pipe 0 slash or pipe 1. This is the three main genotypes you will see. There's other. You will see there's other here. So in the middle, slash or pipe depends on if you don't have phased your data, you will have only slash. When your data has been phased, it will be pipe. Just a way to recognize your data. It will be one apple type and the other. But most of the case at the beginning you will only have the slash because you don't have phased your data. 0, 0 means reference. This is ALL 0. This is ALL 1. So 0, 0 will be homozygote reference. 0, 1 will be heterozygote. 1, 1 will be homozygote variance. Now we'll see here in that two line that alternate ALL could be more than one alias. And in that cases, it's what we observe here. We have a genotype which is called 1 slash 2. So because when you have in the alternate here, when you have a comma, which is the different alternate ALL you can see, it will be ALL 0, ALL 1, ALL 2, and so on if you have more than one. So here when we see 1, 2, it's a GT genotype. Yes? What you don't know is the combination or what the number 0, 1, 2 means. Combination. So it's a combination. So when you have a genotype, you have two ALLs. In terms of ALLs, they are noted like that. ALL reference is ALL 0. First, alternate ALL is ALL 1. And second, alternate ALL is ALL 2. So 0, 1 will be heterozygote. First, alternate ALL 0. 0, 0 will be homozygous reference. 1, 1 will be homozygous variant. In our case, 1, 2 will be heterozygote, but with two ALL variants. It's the name of the sample. So each one is correspond to the genotype coding that have been done on each samples in the file. So each field is described by the format field. I have a question about the multiple samples that you talked about. What use to determine which multiple samples you're combining together to call at the same time? Would they all come from the same library or could you just take a look? Yeah, yeah. No, you need to call a sample from your own experiment altogether and with an harmonized library, sequencing, and so on. If you start to include samples with different, for example, a kit of different sequencer, it won't work. So it should be the most harmonized as possible. So you will do that. And after, if you want to include information from other, then you will start to do some computation or comparison. Yeah. So yeah, when you phase, you use information usually from population and also from the BAM file, depending also on the site of your reads. For example, if you use 10x genomics to do really long reads, artificial long reads, or if you use packed bio nanopores that have long reads, you can use this information. But in small read approach, you could do some phasing based on the reads, but your phase block will be really small. No, because all the genomes we are looking at, all the references are applied. So you always, actually, you always align your data on a fluid reference to take the reference. So you don't need an aligner. When you want to deploy it, when you want to use a deployed reference, then you will have to use some specific aligner or some kit around aligner to tell there's two copies of the same, of the same. Just for the alignment part, it's all the reference genome by default are applied version of the genome. For the variant coding, it will be different. For the variant coding, you should inform your variant colors that you are on absolute situation. So it will be only homozygons. There's no heterozygons. So for the variant coding, yes, either to change your variant color or to inform the number of prototypes you have in your experiment. So many colors have these parameters to tell you. But by default are two, but you can set about four, one, eight, depending on your data. But for the alignment, no. It's complicated. It depends on each variant color because they don't compute the same way the genotype quality. So most of the time, I would say around that. For absolute color, genotype quality around more than 40 start to be a good one. Sometime when you really want good quality call, when you do, for example, when you want to do some imputation, when you want to keep the code that we know we really trust, and we do at 60. Yes? It has the same opportunity to start a variant. What do you mean? Like, what you explained is more of a small variant. So when you're doing for a large variant, like what you call the start of a variant, it's like the same approach. For the variant code? No, it's a completely different approach. And a completely different method, and we'll see that later this morning. So now that we have done our variant code, what is interesting to do is to filter our data and to do some kind of annotation. So when you want to filter your data, there's two main approaches you can use to do that. The first one is to do what we call hard filtering. So to go and manually filter your data. So there's a tool in GTK which is for variant filtration or in a SNP-STIFT, which is one of the software that is included in the SNP-F tools. And what we'll do in that case is you will know some criteria and you will use it to remove all your variant data. Don't filter these criteria. So for example, you say, we talk about base quality, like, okay, I want to keep all my data with a quality score of 60 on a region where I have at least, for example, 20x of coverage, etc, etc. It works really well. It's really efficient, it's difficult because you need to really understand the algorithm you have used to do the coding, all the parameters, all the different parameters to have this really, this expertise on each criterion and how it affects your data. I will require you to do a lot of benchmarking to find the best set of criteria. The other method, which is more something which is more a machine learning approach is to let, learn some rules based on what you have in your data and then use that to do the filtration phase, what we call the variance calculation. So this method works really well, but it has one main disadvantages is that you mean, is that you need to apply this a lot of samples. I would say there's no numbers that have been fixed because it depends on how many samples there are, how many variance samples, but usually when you have less than 20 samples it will probably won't work. The algorithm will fail because you don't have enough information to do the learning. So today we won't perform that. We will do the first one, but in general when we have large court we use the other one. So how it works? It works in a machine learning based approach. So what you do, it takes databases like Omni, Southern General, AdMap to extract a core set of variance. So it's a set of variance in your samples that are part of this different consortium, and so it's a variance that are really likely to be true positive variance. So we'll select this true positive variance and learn rules about this variance. So the different parameters could be depths, could be variance quality, but all those are it could be strength, it could be all the parameters information that have been provided by the caller. Then it will apply the rule to all the variance in your data in all your sample. And then you will tell him I want a sensitivity of for example 99% or 99.9%, and it will apply the rule to get the sensitivity you want and to select the variance that fits into this sensitivity. How many VCFs would you start with? I mean sample, because I will have one VCF, but for that I will start between 20 to 30 samples. But the more, the better. And is there specific information you're extracting from your VCFs? What are your features? That's one of the main disadvantages is you don't know exactly what are the features that have been used. I think you can dig a bit, but you don't have a clear set of features that you can use after that to redo it manually. So If you're extracting variance from these databases, isn't the algorithm going to export your data to have certain information in order to run the machine learning algorithm? What do you What sort of data is the algorithm expecting to have in order to run? So it's VCF from your sample. With the different information you have in your VCF and it's VCF of other databases. So it's a non-variant of other databases. So it's just what you need to have. So we talk about databases, so app map and the business. So it's what we can use to try to understand our variance. So the app map is a way to, so it's a big project that provides a lot of information. There's no also it's 10,000 100,000 genomes in UK that is ongoing to provide like this high quality variant that helps you to drive for false positive. And you have also database snippets. Whenever somebody, it's really good data. It's like 148 I think now the version. And you have million, million of snippets. So here it's more to for false positive. So when you have filter your variance using the different approach, what you want to do is look at the annotation of your data. So what we currently, what we often use is a mapability flag. So we use our own mapability tracks that we develop. But there are some mapability tracks that are available from Ensembl and other consortium. The idea is to look at two flag regions of the genome where you know there's a mapability issue. Either there could be lower reach, so more difficult to map or excess of reach. So a lot of repeats could arrive in this region and probably there will be false positive. So it's just a little flags to tag some variant to say, okay, this variant is on region. I'm not really trusting so I will don't put all my money on that variant. So the database snippet which helps you to know if the variant you are looking at is already known and have already been reported by other groups. So if you are looking for variant or real disease but you see that this is a common variant that is known you will say, okay, this is not the one I would be interested in. Now you have really the real interesting annotation we usually use. It's to do to try to understand what is the effect of my variant. So there are many tools to do that as we use and we will do today's snippet. So the idea is to look at the you take all the transcript and if a variant cover a transcript you look at the position in the transcript, the frame and everything and you are able to predict the effect of the variant on the gene. Will it change? Will it create a stop codon? Will it do just a synonymous change or whatever. So if you are looking into your variant if your variant create a stop codon in a protein, the impact of this variant is way more important than you just do a synonymous change where you say, okay, you don't have a real impact, direct impact, you could have an impact in the tree restructure or other but not such a major impact. DB NFCP NSFP, sorry, it's to have a functional annotation of the change so it's a bit different. It's more at the level of pathway and protein structures and if you work in cancer it's there's a cosmic databases you can use to see if this variant have already been reported as somatic mutation leading to a specific cancer. What's your cut-off for saying that this is a variant? So let's say you have like 100 read per area that cover sorry, that cover a base and only 10 of them have... In germ lines it would probably not be either in a position of data. If I have like 10% of my reads it would need really, really good quality and perfect matching and otherwise it would probably be a discount as possible. Because you expect to have normal cases if you don't have a mapability issue, if you don't have a duplication or whatever you expect to have 50% of your reads on the variant for it is agus. So you you expect you expect the real variant with frequency to be like that 0 to 1% and you expect to have here. If you follow here you could have depending on the quality of your reads and so on but it's made to trust this one if you are in a normal germ line way of viral calling if you are in a tumor it's completely different because as we say we have purity, mortality, and originality that take into account and that what are the many studies. So today we'll do annotation with Snipef as I say, as I explained look at the annotate based on the reference genome and on the different transcript it will look at coding variants or not coding so in transcription factor by insight and it will do a basic prioritization of your variant telling you if it's high impact moderate lower modifier it will also provide some metrics which is always good to have so when we do as I say, when we do we go to variants it takes usually a couple of hours to do it but when we do annotation doing the annotation could be really easy to get your Facebook data but having a real annotation by somebody that knows and knows the biology behind the project and having him looking at the data taking the variant looking at the gene that is impacted could take days, months and whatever so it's the hardest part that we won't do today but there's other workshop where you can learn more about how you can increase your expert judgment there's one in Montreal in mid-June I think it's genomic medicine workshop you already saw that yesterday just to let you know VCF could be open in IGV so it's easy to have the information when you look at it usually what you have here in the top panel of the VCF part of IGV you will have the frequency among your sample of the variant and then you will have a description for each sample while my general message matrix, matrix, matrix, matrix I need to update that next time to put some multi-QC instead of this old fashion report and just to let you know so SNPF what is cool is it provides also a lot of statistics on the variant and we will see during the practical session okay any question? yeah no but because it's just on how to put sequencing there's a workshop on genomics here, me I'm teaching also in UK and full week of concert genomics I can talk a bit but it's a totally different world so it's why we cannot talk otherwise it will take days and days to talk about that but do you have any question on concert genomics? yeah exactly so you have all in concert so you have problem with concert is first you don't try to look at the same information because here we talk about germline variant in concert sometimes you could be interested in germline variant or LOH variant but a lot of people are more interested in what is called somatic variant which are the variant that I found only in the concert self and not in the germline so first it's more difficult because we work in paired approach where we have usually blood samples with tumor samples and what we want to do is try to find what in the tumor sample is different from the blood samples that's the classical case because if you look at if you work with for example leukemia which is a blood concert you don't use blood so more complication again so we still have this other approach that we this paired approach that we make us using different tools and different methodology to do variant calling but then we have the concert specific challenge which has when you take data from concert cells many cells and your sample cells are dead so the probability of your sample and many cells could be and also your sample could be a mix of is always a mix of normal cell and tumor cells so the real number of cells you have in your sample it's a percentage it could be 90% it's good but sometimes it could be 30% of your data that are real data that are real concert cells so imagine you have 30% then you have all the heterogeneity of the concert and you have the clonality so not all the cell in concert should show the same variant so if you have a clone which is 50% of your cell of your concert cell in a sample which have only 30% of your of your cell is concert you only look at 15% of your reads that could represent a clone which is heterozygot which is 7% of your reads so you can imagine how complicated it become for the variant color to do producing so all these variant colors are based on a model where we expect 50% of reads would fail to do that so it's a totally different work to do when we work in concert yes just to go back to the four strategies to improve your data before you do your variant calling alignment your sample yes it's too because as I say yes if you have indel which you have a real indel in your data it's where it's likely that not all your reads will be mapped including the indel all the reads that contain the del will be mapped with the indel because as I say alignment is a penalty model and most of the time including substitution cost less than including indel especially for indel for several basis so in that cases you will have some reads that don't include the indel and instead have several substitution or several variant reads in the closed area of the indel so the idea is to go where you think there's an indel and to just play with the reads and see if I include my indel will it create a more harmonized region and a more high quality region so the base of calibration is correcting for each base in the sequence to give him the correct numbers that it should be depending on his position on the read on the genomic context around him because the base quality is used in the Bayesian algorithm to estimate the probability of a variant so I cannot show you a cartoon that tell you how it works because it's more at the level of at the really back level but it's a okay yeah yes sorry we see a data and their quality is this side between these three where 0, 0, 1, 2 and 3 yes now this is the three sample yeah so this the ninth color format is a description based on either what will be the information displayed for that variant in every sample then you have each sample that displays information in the same order like what is present in the format the fourth one is a variant the fourth one I think it's not a variant probably it has been as I say a very simple approach they look first in the genomes to find a position where they are possibly variant so they look at this position some read in each sample there was few reads that have been found as possible of the variant so they put it in a position they will look for a variant but at the end when they do the maximum like they would say oh no the sample seems to have something so probably it's not a variant that's why it's here I have to ask you another question it's a multi-sample approach let's say you have an overseeing exome 36 exomes a couple of the samples come in this room you still would compare probably you will end up with less quality samples the quality the sample will be lower due to the lower quality of your data but as you use the same exome kit the same sequencing approach it just has less information so taking information from each other would be the best way to maximize what you can get from this data if you separate them with so much information you will have really a lower level here you will be probably able to get some code that you will be able to see but you would still do multi-sample but let's say a few of the samples had really low coverage 90 of them were really good 6 of them were really low to include the 6 because as I say it's a standardized sequencing experiment so you just miss information so try to get this information from the other use the other data a lot of coverage to try to define where the good side are and then when you go back to the maximum likelihood you will get what you get based on the coverage you've got so maybe you will have a lot of false negatives probably but you will probably have more false true positives and you should only separate the 6 and analyze the 6 after that when you do you should do some analysis with your variant perhaps it will be time to think should I include this one which have a different, which have a lower core in my for example in my case controlled analysis because they are not at the same quality but for variant coding I would definitely include them if there is this 06 sample as I say at the moment it's not different it will not bring information that will mislead the other because it will be the same experiment, the same kit, everything it cannot bring the full potential of information they will have but for variant coding it won't affect it and I say for subsequent analysis you will do then be cautious and think about what you are doing but for variant coding I will definitely include them yes this is again the multi sample approach so usually we send 10 or 20 samples at a time to get sequenced are you saying that when it comes to cases of the disease or phenomenon we would be better to prove all of them and run everything again or is it due to batch effects is the assumption that the disease is homogeneous underlying it so I say if you have uniformized how your sample is have been generated at the beginning if you have the same library kit some example kit if you are doing an exam some instrument if you see that your run works the same way and everything is okay I would include so with applied color what I will do is to do so when you do instead of you will do usually individual GVCF which is collect information on your sample without doing anything then you will merge your sample by batch of GVCF and then when you will merge your batch you will do the variant collinear genotyping all together yes because if you don't do that the problem is if you have large quarts and it's a problem for everybody in terms of resources if I was part of a consortium that we analyze 1000 Alzheimer's samples or case and controls world genome we were working by batch and doing multi sample collinear by batch of hundreds of samples and then we start to for a special issue to remove the bar or put it on tape and then in the last batch we start to find some variants that will not include in the other batches in that case it's problematic because you cannot go back you can but it's a lot of time and a lot of resources to say okay the variant I found here I didn't find in the other batches why because I don't have data because I don't have evidence because whatever so it's always better to have a final genotyping where you take everybody together so when you look at one variant you get information, the real information at all the other samples in your analysis and you say okay if I don't see a variant there it's either because I don't have read to cover this region or I don't have evidence of the variants and it's a it's largely to deal with batch effects sorry it's largely to deal with batch effects usually in if your run works well you don't have such a batch effect for a variant calling after that we observe some batch effect sometimes because we have a failure of the HVAC in the building during a run, yes that would be something to take into account but if you are in a sequencing center that run in a production mode same instruments, same libraries probably same technicians should give quite similar approach but after that you still you still have the information in your you still keep the information of batch somewhere so if you start to see a variant, some variants that pop up specifically to a batch I would say perhaps that's a batch effect but by default I would not put too much pressure another thing is we talk about variant, annotation, filtration but as Hamza said yesterday this is just prediction whatever you will do the interesting variant will go and use an autodox another technology to validate otherwise people won't accept your variant so just a matter of prediction this is super interesting variant and it's a batch, you do a sunglass sequencing on your variant and you will have the transfer and you will see the coverage or whatever that was also