 So, my name is Guillaume Vogue, so as I mentioned this morning a little bit, I work at Miguel at the Genome Center and I also lead the Canadian Center for Computational Genomics that basically is a platform to do bioinformatics analysis and service as part of Genome Canada. So again, as mentioned by Anne this morning, so this is a new workshop, you know, we set it up with Mike really starting with sort of a basic introduction about the types of things that, you know, where variant calling in particular can be useful. Mathieu and my module are a bit more technical, sort of going through the steps of really variant calling and variant indentation and you'll see that at the end of the module three, Mike will actually come back and sort of then once you've generated these variants you'll go back and link back to his introduction on how you can actually then really use that to interpret disease and phenotype. So again, hopefully that whole overall flow will make sense. So my module really follows on Mathieu's module, so Mathieu was talking about variant calling and mapping and I'll be talking about variant indentation. So going back to the sort of the overview of what the data processing for calling variant looks like. So again, Mathieu covered the initial step with the data processing. Now we're going to be going over in module three, the variant discovery and variant annotation. So the specific objectives of this module is to, I guess that's the team for today, identification of disease causing mutation, understanding. So I've included some slides, we haven't talked about this yet, but what are the limits and the challenges with this and, you know, should you be doing exome and whole genomes, I'll be covering that. And then I'll go into the more technical part of the VCF and really these files with the variants and how do you actually filter and prioritize the variants that you get. So, you know, first of all, what's the overall objective here is to identify disease causing mutation and Mike touched on that a little bit, but so the terminology is, I think, important here. So you've got, you know, pathogenic mutation, which is what we're trying to identify that contribute mechanistically to the disease. And again, Mike touched on that, it's not necessarily fully penetrant, but it, you know, but that's, so that's, you know, the objective is to identify these. You can have variants that are going to be implicated in the disease. So that's when there's evidence that's consistent with them having a pathogenic role with a certain level of confidence. We've talked about associated variants that are significantly enriched, but there's really this whole range damaging mutations that would alter normal levels of the gene and then finally deleterious. So you really have these different sort of types of mutations that you might be looking for. So how do you actually identify variants that are implicated in human disease? So this paper that I have the, you know, the reference at the bottom, I think it's quite interesting in providing, this is just a subset of some of the recommendations, I guess, to, you know, really associate variants to the disease. So there's some general guidelines, really sort of, you know, you have to, you know, an observation you'll have to actually compute the probability of observing that by chance. This is related in part to, like the GWAS, if you're doing multiple hypothesis tests, you have to, you know, really calculate the probability of observing that by chance. An important guideline, and that will be sort of, I'll go back to that, but taking advantage of public data sets is obviously very important. So knowing whether it's a rare variant or something that's common in the population is a, you know, a very important aspect in implicating variants in human disease. So a number of guidelines, again, I highlighted a few of the key ones I thought. So what's the evidence for that gene to really be implicated? Is this a new report, but only do that when you can actually see, and again, Mike had a slide on that, you see the same variant in multiple individual, comparing the distribution between, hopefully, a match control cohort. So reporting what's the evidence for that variant to be, to be pathogenic. So recognizing the strong evidence that a variant is deleterious is not necessarily enough to implicate as a causal role. Ideally, you also want to be able to have sort of orthogonal experimental validation that actually confirms that that variant has an actual damaging impact. So, and then finally, it's important to highlight the actionable findings. But you also want to be reporting the additional findings at the same time. So again, so this is just sort of a snapshot, but I really recommend looking at that particular paper that provides guidelines as an example. So moving on to the next topic here, which is to identify and understand some of the limit and the challenges when you're trying to do this variant calling and annotation. So here's another, I think, recent and great review. And so I highlighted some key components. There's a big difference between, you know, so we talked about that a little bit in terms of applying these technologies in the research context versus applying them in a clinical context. So there's, you know, there's great potential to apply these technologies in, you know, really for diagnosis and in the clinic. But, you know, there needs to be some adjustments in how these tools are applied once it's done in a clinical setting. And Carl in the module at the end of the day will go back to that point in terms of some regulations and things like that. But there's also sort of more, well, some basic principles where when you're, so the quote that I'm highlighting here I think is quite interesting. So, you know, most of the algorithms that we're going to be using here were developed for discovery. So, you know, if you're missing a variant in exome sequencing, you know, at some level you're missing an opportunity for a discovery. And that's, you know, it's unfortunate, but it's not necessarily the end of the world. But once you're applying these methods in a clinical setting, making an inaccurate diagnosis because you're not picking up a variant is a whole utter ballgame. So related to that, it's quite interesting to think about. So once we start talking about applying these sequencing technologies in the clinic, you know, one of the first questions, I guess both in the research and in the clinical setting is whether to do exome or whole genome sequencing to detect these variants. And in this review, I think they do a great job of sort of showing and highlighting some of the differences and the implication of those differences. So here at the top you have, in the example, in a coding region, so this is, this is, this would be, for instance, an exome of the coverage you get from whole genome sequencing, from a standard exome capture sequencing, and from a sort of augmented or targeted special exome capture kits. So some of the newer generation of capture. So you see that, you know, well, first of all, you've got irregular coverage, even from whole genome sequencing, some parts are not necessarily covered as much as others. In some case, with some exome capture technology, you might have even bigger gaps. So this is in the coding region. Once you move to sort of a genomic region that includes coding and non-coding, you see again that whole genome coverage is not completely uniform such that you still have some regions that have relatively low coverage, at least here, you're covering most of the genome, but the coverage is not uniform. And again, that contrasts quite a bit with the exome, which, as you can see here, as expected, sort of covers very specific regions of the genome that are coding. So what does that translate to, and especially in the clinical context? So, so this is quite, so maybe we start with, we're looking over here. So these are the 56 most actionable genes from the American College of Medical Genetics guidelines. So these 56 genes are the ones that systematically should be, so if variants are detected in these genes, they should be reported. And what this plots are showing is that some of the bases, coding bases of these genes are not necessarily very well covered. Well, if you take, you look, for instance, at this exome array. So they did this experiment to see, you know, what the number of bases that are not covered for these 56 genes at a certain, you know, coverage level and a certain quality. So you see that there's quite a lot of variability. So this shows that even, you know, there are blind spots basically in these very important genes that you won't be able to call any variants in those, using that particular platform. That's also true if you're using whole genome sequencing, although this is a little bit better. And in some ways, it might be that these more targeted exome, in this case, have a slightly better coverage. But again, it's, I think it's useful to be aware that, you know, the choice of technology and whole genome sequencing is not necessarily, it's great for discovery, but it might, you know, not be the most appropriate thing depending on the types of questions that you're asking. So this is, I think, something important to keep in mind as you're choosing which platform to use and really thinking about the types of questions that you're gonna have. So this is in terms of the actual coverage. So there's lots of other challenges that are associated with using these next generation sequencing technologies for variant calling. So Matsuri again touched on that a bit, but there's, again, blind spots because these reads are short reads. So that has an impact in terms of, you know, whether you're able to map them uniquely on the genome. So you've got regions where you're mapping to the reference genome. So what this is one that's, I think, quite interesting. The reference itself that we're using to map the reads is a genome, right? I mean, it's not quite the genome of one individual, but it still contains variants that are actually rare. In some case, might actually be, you know, disease variants. So if the reference genome contains a disease variant, you're going to not be calling that the variant in your dataset because it's going to, you know, your data is going to look just like the reference at that location. So you might be missing some variants that were already also present in the reference genome. Long repeats and highly polymorphic regions. This is when the short reads are not great at mapping into these regions and where you also are going to have and make some mistakes. There's some more. I'll get back to this one in more detail, but this is about the fact that variant calling for single nucleotide variants is increasing. You know, it's pretty accurate and we have good pipelines and good algorithms to do that. But as you go towards other types of structural variants, that can be sometimes more challenging. There's additional challenges that are at some level a bit more technical, but the way that the variants are encoded in the VCF files and we'll get back to that. Sometimes there's some ambiguity in terms of the variants that you're reporting and might lead to mistake as well. But the key thing I wanted to expand on here is these structural variants here and other types of variants. So in the context of the module that we're going to do and the practical that we're going to do, we're really going to be focusing on detecting these point mutations. But especially with whole genome sequence data, you really have, in theory, the ability to detect a whole range of variants. So whether it's short indels, copy numbered alterations with deletions or duplications, and then translocation, foreign DNA and so on. So there's a range of variants that you can detect. So as I said, the focus of the pipeline that we'll present today are really around the single nucleotide variants, which tend to be the ones that, you know, where the benchmarking has been done and we have a good confidence in the calls that are made out of the data. But as you go towards some of these other types of structural variants that or regions in low complexity, for instance, there's clearly a lower accuracy of calling these variants once you're in those types of events. So I was just discussing, I guess, over lunch about, you know, what happens if you've done this single nucleotide variant calling? In some case, you know, you won't find any very good hit, and it might be that it's through some of these other types of events that actually is associated with the disease. So, you know, I think it makes sense to start with the single nucleotide variants because that is, you know, where does, you know, again, we have good accuracy and we are able to call those. But it might also be worth applying some of these other tools to the data to pick up these other variants as well. So related to that, another challenge, and now we're getting closer to the variant annotation. So this is a project that I was part of where we did the sequencing at the genome center. And so we sequence, so this is some of you are working in cancer. So you might be already aware of this. But so this is sequencing 100 kidney tumors. And what I'm showing you here are, so every square represents a thousand mutation that we found. So you see that in total we found more than half a million mutations from this data. So the challenge to identify, you know, within that giant list of mutation, which ones maybe are associated with the disease, can be quite daunting. And so this is from whole genome sequence data. The obvious thing, of course, is to focus on the coding mutations. So that's a very small subset of all the mutations that we detect. Most of, so again, it's funny because you use the whole genome sequencing and then in the end, most of what you use is really a very small subset of the mutations that you detect. Those are the one that we know a little bit how to annotate, how to predict the potential impact. And that's what we'll do as well today. But there's potentially some damaging mutation in the non-coding regions of the genome as well. And that's some of the things that we'll talk about tomorrow. So that's, I guess, another sort of ongoing area of research is how to actually annotate and prioritize and filter not just the coding mutation, but also the non-coding mutations that we detect. Okay, so that was my sort of my introduction in terms of what our objective is, what we want to do, and then some of the challenges. But now we're going to sort of continue on what Matthew started to present and continue with the workflow of annotating variants. So again, the overview of the workflow is that you do some data processing. You want to be mapping, and that's what we're going to do, assuming that the cluster is working, how you actually map the data, call the variants, and then annotate the variants. So before I talk about the variant annotation, I wanted to add a little bit to some of the things that Matthew talked about before lunch. So Matthew talked, so just on the variant calling still. So Matthew talked about, so you map the reads, and then there's a number of things that you can do to actually improve the variant calling. You can do local realignment to improve the variant calling. You remove the duplicates. You can correct the quality scores. Another interesting thing to improve variant calling is to actually use a family or a population structure to actually improve the calling. So this especially is important if you don't have extremely high coverage. Instead of calling the variants separately in the samples, you might actually want to call variants using the information that you have on the pedigree or on the population. So here's sort of an overview of this. So again, on the right side you have the basic calling, re-mapping, realignment, and so on, and then you do single sample. One at a time you call the variants, and that's going to produce a list of variants. But you can actually, and depending on whether you have a cohort or you have, again, families, you can also call the variants with that information in mind, do multiple sample calling. That actually has the potential in some case, and I didn't put the slides. I showed that, but in certain cases, especially if you have low sequencing, as I said, it might actually help you improve the quality of the calls quite a bit. And just to give you sort of a, I guess I do have the slides coming up, but to give you an idea of how that works, so suppose that you have two haplotypes in your population, so in the population there's a limited number of haplotypes in a given region that are in that population. So if you have information about this, that there's really a haplotype blue and a haplotype red, so if after that you observe reads from an individual that have these letters, you can probably guess what the end is without really looking at it. So again, you can use information about haplotypes in the population and about other sample to sort of correct or do a better job. But if you see an error in a read, you'll be able to detect it more easily. So again, I don't go into the detail, but I was just to give you an idea of the fact that in the context of variant calling, sometimes it's useful to either use information from the population or use other sample. And this is, I was not sure if I had that plot, but this shows from the initial papers from GATK, how much this can help improve the genotype accuracy, especially for in this low range here of, you know, rare. So but again, that's that was just sort of an aside. Yeah. I've seen one of the ways that you improve the accuracy by getting rid of duplicates. Yes. But duplicate reads don't really change the call. It would throw off your VAM or something like that. That's right. But it's just to get cleaner data or. Well, so the problem, so suppose you have an error and then that error is duplicated, right? So if it says, so you're going to think that you saw it five times and then in, you know, you're going to weight it too strongly, but if these are all exactly the same read, that's where it makes a difference. So how do you tell the difference between a duplicate and actually two independent reads? So it's usually based on the fact that it's exactly the same start and end. So if they have exactly the same sequence, those reads get pulled out. They get marked as duplicate and then they just get counted once in this calculation of whether, you know, because you're waiting the observation, right? So it just gets us a weight of one in this case. And then typically if you have relatively high coverage, you won't trust these, you know, single read that are saying that there's a change here. So, so again, so putting in this relates well to the, to the presentation by Mike, but taking information about the pedigree and calling and then looking at the segregation of these variants actually helps you quite a bit identify the good candidates. If you see, you know, if there's a very strong phenotype in the kid and that variant is found in the parent, you know, it's unlikely that that's really the cause of the disease. Okay. So this was sort of my add on to the module two on variant calling. So in terms of what types of files and what's the size of the files that you're dealing with. So you're starting, well, in the practical, Mathieu was kind enough not to start with the full file, sequence file, which would have been very big. But, and then you apply and that's what we'll do in the practical. You apply, you apply tools you actually call the variants. We're going to be using VATK. There's alternative ways you can also call the variants. So you go from a very big file that contains all of these map reads to something that's much more manageable, manageable, sorry, which is the VCF files, which actually has the call. These are per patient. So yeah, so one patient, this is whole genome sequencing again, but that would be the size per patient. So if you have a big cohort of patient, this adds up to quite a bit. So that's why, you know, learning how to use clusters like Compute Canada is useful. By the time you get to the VCF, these are files that you can more easily load onto these, you know, web services, a bit like what you'll do with Mike at the end here. So what do these variant files look like? So again, Matthew talked about the FASTQs and the BAM. If you look at what the VCF files look like. So this stands for variant call format. It makes a lot of sense. So you've got a lot of, so header lines which provide information on what was run, what reference files were used, and various parameters of the variant caller that you use. So this is header information. And then the key information is embedded in all of these rows. So each row corresponding at this point to a variant. So you've got the chromosome position, you know, the ID of the, so the reference base, so what is usually seen in the reference genome at that position, what's actually observed, a quality score associated no longer with the read, but with the call itself. So that, you know, we're just talking about the number of reads that support that variant and so on. So all of that gets converted into a quality score associated with the variant call. And then additional info fields that contain information about the reads that we're supporting, that we're supporting the call. So again, so we'll go over that and you'll go over that in the practical with Matthew some more. So that's the basic file. And again, this is at least more manageable in size. These are all the positions that are different in your genome relative to the reference genome. So, but as I said, so you'll have, you potentially have in a given project, you know, a very big file, lots of number of variants from this. So the next step is really to sort of filter and annotate these, this file with the variant. So how do you filter the variant? So you can do some manual filtering based on, you know, how many reads and what's the score and things like that. But if you haven't done this before, it's a bit challenging to know exactly, you know, where, where do you draw the line? So do you take, so in here, if you have quality score of three, you know, is this a variant that you should pay attention to, right? So how do you know which scores and which, which variant to filter on? So, so there's, there's nice tool that actually actually using data and I'll show you how that works that actually use the data itself and known calls to do a better rank order of this based on the likelihood of being real. So, so we won't be doing that today, this part. But just so that you know that that, that can be done. So, so what you do is you actually take data that's, that's a good proxy either for false negative. So you, so you want to know, so, so there's high quality data like the HapMap project that actually reports known variants in the population from the HapMap project. So you can use that to see whether you're actually missing some, some variants in your call and you don't want to be missing out too many of what are known to be, you know, reasonable variants. So, and then you can use other data set like, well, it's a bit scary to be using dvSNP in this context, but it's known that in particular in, you know, there's a lot of variants that are mistake as part of dvSNP that we're included. So you can include that as a set of potentially false positive. And again, in the GATK package, what they've done and they've shown is that you can use these set of good calls and not so good calls to actually train the parameters and better rank, better rank the variants. So this is sort of, again, you're not going into the details of how that worked, but you can use that feature to actually recalibrate the variant score using, you know, good quality, you know, known variants and known false positive. You use that to recalibrate the score of the variant calling algorithm. So this is one of the feature that actually is also included in the framework that's being used for calling variants. So I'm coming towards the end now of what I wanted to present before we could get into the practical of actually doing all of that. So now the variant annotation and prioritization. So so far, all we've done is map the reads, call the variants, sort of clean them up and filter what look, you know, sort of rank them in terms of how likely they're good based on various properties of number of reads and so on. But now the other key step is really to annotate and prioritize them. So, well, so I've touched on some of these annotations already. So quality and confidence score is obviously a criteria that you want to be looking at when you're looking at these variants. So what was the quality of the call? If there are reads that are in low confidence regions, you want to annotate that. So this is annotations about the quality and the confidence. Another important thing, and again, talked about that a little bit, so you want to be able to report whether the particular variant is found in DBCIP has been observed before because, you know, if it's in 20% of the population, it's unlikely that this is the rare, you know, de novo mutation that you're looking for. So having, and I'll get back over some of these database of previously observed variants, but previously, you know, the annotation of whether there's been observed before is quite critical. So you have then this is in population, mainly, then you've got specific disease databases that you might be able to use to know whether a particular variant has been associated or observed in a clinical setting. So this is ClinVar. Again, I'll go over these databases a little bit. I should say that these lists are not comprehensive, but this gives you example of some of the important annotation you want to be adding to your variants. And then the last category of annotation is these variant effect predictor, and Mike touched on that this morning, especially for coding mutations. There's lots and lots of tools. There's lots of tools that can predict the impact of the variant on the gene. So whether it's likely damaging, and all of these are, for the most part, just putative effects, but it's still quite, you know, if it's in search of stop code on versus a slice change versus a synonymous change and so on. You also, as I said, for, you know, the majority, if you're doing all genome sequencing, the majority will be non-coding variants. The annotation there of the effect is less well-defined, but we'll talk about that part a little bit more tomorrow. So going back here, so I've talked hopefully at some level about these confidence scores. I'll talk now a little bit more about these databases that can be used. So separate from 1000 Genome and DBSNP, one resource that was initiated as the exact database and has now become the genome aggregation database, so our Nomad. So this is a great tool where basically a great database where people that are sequencing as part of various projects are submitting their VCF from exome and whole genome to be aggregated into this database. So again, you know, it's a great way. So as of right now, it contains data or the variant calls from over 100,000 exome over 15,000 genome. So this is a great resource because again, if you're sequencing a new patient and you want to know whether that has been observed before as a way of prioritizing, this is a very useful resource because you're able to say, you know, if it's, you know, you choose your own threshold of what you're comfortable with and maybe you don't set it to zero, but if you expect something that shouldn't be found too frequently because it's specific to that disease or that patient, you might set some filters and frequency in this database. So again, this is sort of really at large, any exome and whole genome sequence data. And then, but so here, the only information you get is on the frequency of that variant in this particular database. Yeah. These are people with benign disease. That's right. So or in many cases, just population, right? So, yeah, so that's a good question. So, I mean, again, here, so if it's a different platform, you might not have a call in that database because it didn't cover that particular region. So you can't really use it for being sure that it's not there, right? So but again, if it's been observed in that database, at least you know that it's been there, but it's not fully homogeneous. So it's different platform and different, different variant calling in some ways, right? So it's not a fully homogeneous data set, but at least if it has been observed before, you know, you know, but you can't use it to say it's never, you know, it's not been observed. So this is to give you confidence that that variant is real. Typically here, we're using it to say it's not been observed before. So if you're looking for something that's not been observed before, you know, that's a useful resource. That's typically the way these database are used. In cancer genomics, just to think if it's a benign polymorphism, especially if it's greater than one. That's right. That's right. That's right. That's right. But again, if it's found in 10%, yeah, it's unlikely to be very damaging. Okay. So these are really database for providing frequency information of variants. Then you have more targeted databases that are collecting information about variants that have been associated with disease and with supporting information. So this would be used sort of in the reverse way where you actually want to find, you know, something that's not too common in the population and then has been previously associated with a disease in a different context. You would probably want to increase, you know, that pushes up that variant in your list probably because it's already been associated with a disease. So this is true at the level of sometimes, so variants, specific variants. So this would be ClinVar. And in cancer, you have other databases more like Cosmic. At some level, there's also now a database through ICGC that really has also a frequency of different variants as opposed as observed in different types of cancer. So, but overall these, you know, this type of annotation of a frequency in population and of, you know, previously been observed and been associated with disease is obviously these are all fields that you want to fill in and add to your variant list. So the last category that I had after the database, if you remember, were these effect predictions. So another type of annotation that you want to add to your variants is predicted effects. So, you know, separate from presence or absence in various database, can you predict what that variant is doing to that gene? So here, so Mike talked a bit about polyphen too. I mean, there's a total of 87 different tools as reported on this particular website that do this. So there's lots and lots of different ways you can do that. And, you know, people have different opinion about what works, you know, better or worse. I don't think there's a single tool to do that annotation that's necessarily better than the others. It's good to sometimes have, and if you can sort of have annotation from a few of these in parallel to predict the impact of the variant. So now a bit of a challenge is so I've already mentioned the fact that, you know, you want to be annotating your variant with databases, you want to predict their impact. So you have to run all of these different tools separately. So thankfully not, because there's a number of tools, including the one that you're going to be using in the practical that actually aggregate different types of annotation. And so you only have to run one tool to do this annotation and itself it will go to the various appropriate database and will actually predict impact using also sort of a collection or an ensemble of effect prediction algorithms. So again, there's quite a number of tools that do this in a different way. You know, in some cases they will claim to prioritize the variants even better. You know, I wouldn't say that there's just one way of doing this. Again, at some level it's convenience to be able to run one tool that's going to aggregate annotation and effect prediction from a number of tools will be good. So again, so we'll try one of them that actually aggregates as you'll see a few different types of annotation. Finally, so it's again from the same resource, they were pointing to a few tools not just to annotate but now to prioritize in the context of cancer in particular, there's a number of tools that are interesting that actually will prioritize genes based on the different, you know, a number of mutations that they're getting and so on relative to what you would expect by chance. So again, there's here again, whether it's based on the number of mutations per gene relative to what it's expected or whether you have, you know, an enrichment for mutation specific pathways. So you have again, sort of a number of tools that can do that for different purposes. So the tool and the approach that we'll be using in the practical is a tool that was developed by Pablo Singolani that used to work at the genome center but is a tool that's really well integrated with GATK and is used quite heavily. So, you know, so using SNPF, you can annotate using reference genome, you can calculate the effects of whether it's a, you know, if it's a coding mutation, whether it has a, you know, what type of effect it has and then it has basic prioritization into, you know, potentially high or moderate or low impact on the gene where it's detected. So this is again, we'll go over that in the practical in more detail. So just to continue with this overview and this workflow. So we started with the very big files with the alignments. We now have, you know, you'll generate the raw VCF file that basically just has good quality calls with quality scores. And so what you want to do after that is annotate this file and get a filtered and annotated list of variants still in the VCF format but now having, including a lot more information in the various other columns in the various fields. So just, so two little things before I end. One of the tools that we won't be using in the practical but that we use to actually navigate these VCF, especially once you've generated this annotated VCF for one patient, you might actually have multiple VCF with this annotation. And so one strategy is to load these various VCF into a tool. So for instance, Jim and I, and that actually allows you to navigate through this VCF and ask specific question to sort of sort, you know, a variance that are found, you know, that match the expectation based on the information you have on the pedigree and so on. So this is not something that we'll do, but it's another tool that I recommend once you actually have generated the VCF, especially if you have multiple patients or pedigree or family information and you want to sort of filter your VCF using additional criteria, that's, you know, maybe a tool also to explore and look at. So a final note before we move on to the practical, go backwards to the practical, is one thing that we won't have time to cover, but that's extremely important, is visualization of these data sets. So I didn't say what this tool was. So this is the integrative genome viewer, IGV browser, which is a tool that allows you to load both your BAMs and your VCFs, such that you can really, because at some level you're generating all of these files, but you don't have a way of checking, you know, what actually is coming out. So it's very important and useful and necessary to, because you might have a mistake in your pipeline somewhere, it's very useful to load your raw data, the raw BAM in this case, not the raw FASQ, but the BAM ones that reads are aligned. This is what you see here in gray. These are all the reads that have been mapped to the genome. We're looking at a particular region, and all the matches are shown in gray, and then you've got the mismatch that actually are, well, there's different views, but in this particular view, when there's a mismatch, you actually see the, you see the letter that said that specific position. So from this, it's quite clear that you've got some errors in your read at the rate that's expected, while the variant that we're looking at really stands out as being very clear. But once you get to the stage of having your VCF and having your variant that you think is most interesting, it's very good and necessary to go back to look at the evidence, basically, that we're in the reads for that variant, because by looking at it, you might see that maybe there was a mistake in one of the processing steps, or that the region is messier than you would expect it, but it's just quite useful and important to look at that. So that's the end of this module. So happy to take some questions, and if not, after that we'll go back and actually finish all these steps from the beginning. Going back to this point of the fact that there are regions of the genome that are not covered, if you have a gene that you suspect or you were looking for mutation potentially in that gene and nothing comes up in the VCF, it's also good, again, to go and visualize that data. There might not be coverage on parts of that gene or part of that exon that might explain why you're not seeing anything. So it's really, even though we're not covering that in this workshop because we didn't have enough time, there's other workshops that you could attend that cover the visualization, but it's really quite important to do that. Right? That's right. Yes. So the basis for this is not uniformly covered? So it's a combination of regions that are not mappable. So if they're pseudogenes or duplicated regions, you might have ambiguity in terms of where the reads come from. There's also like a depletion of GC rich regions in the sequencing. So just for technical reasons, those regions don't get sequenced nearly as much. And so in the whole exome capture data, a lot of time or even in the whole genome for that matter, the first exon, which tends to be GC rich, is frequently less covered. So if you don't have enough coverage, you end up having no calls. So that's another thing. So assume that you have a trio, and you see a mutation in the kid that you think is a de novo mutation because it's not called in the two parents. It's probably good to go check to see if there's any reads that still supported that in the parents because it might be under the cutoff of being a call as a high quality variant in the parent, but it might still have been there. So if you didn't do this three-way calling, you might have missed it in the parent and might still be there. So it's good to go and actually check whether there's coverage on what you think is a de novo mutation. So do you think? Yeah, so long-requencing will overcome the issue of maybe having a more uniform coverage and definitely going into these, you know, going into these harder types of repeats where, you know, so if you have, again, mobile insertions of L1 that are too long for the short reads, you can't detect them, but with the long reads you will be. The problem with the wrong reads is that it's still quite expensive for doing a deep enough human coverage. But at some point, that will be useful. With the longer reads, you also get a higher error rate as well. That's right. So sort of a ying-ying as you're introducing a different type of error. But the error rate, I think, is coming along. And again, as soon as you have multiple of these long reads, you can correct the errors quite easily. I think the cost is a bigger challenge with, like, right now, whole genome with Illumina is in the range of, you know, $1,500 or something like that while whole genome with Pykebio is $20,000 or something. So it's... The other approach is the things like the 10x where you have information about long fragments and then still do the short read. So that's... But these are, I think, still in sort of research and exploration. Is there people who try to... not expensive as a long read but not short read? That's $150 is too short but I don't need a 10k read. Yeah, no, it's a good question. I mean, to be fair, with $150, especially paired in, there's not much of the genome that's not mappable, right? So it's, you know, but a lot, you know, even if you're missing 5% of the genome, you know, there's still the chance that it's that square of the mutation that you're interested in. So how do you weigh that in? I mean, I went back and forth, but again, going back to this, how do you determine what's enough, right? So you're missing out just, you know, 500 nucleotide in these genes. Is that a problem or not? And one thing this doesn't show is whether or not the bases which are missed on subsequent assays are the same bases, just one table. You're missing 100 bases. So I think this is the average of multiple samples. So this is pretty systematic what you're missing. So it's going to be because of repeats, because of lack of capture. So it's consistent. Well, it's consistent within technology, right? So it's not going to be consistent. So that's one of the challenges if you're using these control cohorts and you have a new assay that's looking into regions or essentially it's not covered in the other one, you know, going back to your question in the back, that's going to be an issue. But it's definitely consistent within assay and within... But there's, you know, and again, you have dropouts because of the assay and you have dropouts because of the analysis pipeline that you're using. So you need to match both of those things to really be able to... How long does it take for once you press the button on it? It's bad on the other end. So Matthew, it would be better for me to answer that, but... So, David, so for general, say it's Exxon, it's a question of our route or if we go for World Genome, it's a question of the concern, which is how complicated the two are how complicated. It's weeks, few weeks to to run the pipeline. But we have some we have some way we're part of one project where we analyze the tumor in three days. We need some trade just to focus on it. But in a case, you know... Usually weeks for World Genome. Then you have to add a prior example. It's a better... which I cannot go into running the whole genome. But it is a limitation of the whole genome sequencing. So I think, as Matthew mentioned, we're part of the project where it's used in the clinical context and then waiting two weeks to get that answer is an issue. So we actually have to, you know, sort of trick and use only a subset of the whole genome data to get an answer faster where we set of mapping all the reads to the whole genome. We only map them to the exome to the coding regions which is where or even to a subset of actionable regions such that we get an answer faster and then we wait for the full answer two weeks later. But the time it takes can be an issue with the whole genome sequencing for sure. And it's funny because you're generating all that data and then anyway you're mapping it to the genes. So, you know, for some application it makes sense to instead of mapping to the whole genome which takes time you map just to the genes.