 your certificates of participation. The survey is important because we take your feedback and make adjustments to the course. There's few of you who've been at the workshop for other previous years and you'll know about me. Sorry for that slight delay. Yeah, so my name is Guillaume Bouch. I actually work at the McGill Genome Center in Montreal and I'm responsible for my informatics platform there and all of the analysis. So I'll be, well, covering the two modules this morning that deal with small variant calling and annotation and then also strong flow variant. You'll see that there's quite a bit of link with what you did yesterday. So I trust that you all had a good time yesterday and learned a lot and you'll see. So I'm going to make a lot of back and forward references that you've seen yesterday and you'll see the actual impact that some of the steps that you've done on variant calling as part of my module. So let's get started. So the objectives of this first module this morning is to really get an overview of the variant calling analysis where it flow, right? So that's really the key message for this particular module. So we want to understand the basic principles of variant calling, know what the effects and how we can improve variant calls. Once you actually have the variant calls, learn how you can filter and annotate those variants and then in the practical we'll actually go through all of that. So we'll sort of split the time between really going over the general principles and then in the lab you'll actually, we'll go through the steps and do it. So, and then one of the links to what you've seen yesterday is that we're going to be using IGV in parallel to really look at the results. Okay, so this is a simplified view of variant calling and how the various steps that are associated with variant calling. So at the top you have, so at the top you have just, you know, looking at the raw reads and in some case trimming, so we'll go over that a little bit. You've done as part of module two really many of these steps that are the alignment. So first step, so you look at the data, you clean the data in some case, you align all the reads onto the genome. So this is the main alignment step. And then you can do the actual variant calling. In parallel to that, you know, there's going to be all sorts of statistics that can be generated. So another way of looking at this, so again it's really a sort of a workflow of analysis. You're looking at the reads. Mapping the reads is really what you did as part of module two. And then you get to the variant calling itself. So this is what we're going to be doing this morning. And then also the structural, the larger variants which we'll cover in module six. So before I jump in module five, I want to cover just a little bit some aspects, some additional aspects of quality control and pre-processing because in some cases this, and not in some cases, in all cases this will have an impact on the variant calling itself. So it's pretty important that, and so again some of these things were discussed already yesterday but given how important it is, I wanted to go over some of these messages again. So it's really quite critical, you know, when you're doing bioinformatic analysis just like if you're working in the lab that you actually look and follow each step and do some control and quality control at every step of your analysis. And that's the first step of looking at the raw data before you run through a whole pipeline and run through a whole series of command lines or commands is really important. So you want to be able, and so again, you've covered that a little bit but I want to re-emphasize that that's quite critical to know where the samples that you're going to be analyzing are coming from, where they use, especially if you're going to be doing a comparison of different samples, where they all sent a sequence at the same time using similar protocols and the same instruments. Are there any technical issues that are affecting some of the sample? So, I mean, you should have a very good idea of an answer to all of these questions basically before you jump in and you start doing the variant analysis. So, yeah, and in most case you're going to be comparing multiple samples and not just one, so I mean, this is quite relevant. So you've seen, or this tool was mentioned already yesterday but it's a pretty critical tool that's used quite a bit, I think. There's alternative tool but this one already packages many components that allow quality control. It's FastQC. So this in the practical, again, before we actually go through the variant calling step we'll actually run the samples themselves through this tool and you can look at the specific results. So the samples that we're going to be analyzing in the lab actually look quite good. So this is one example of the read. So you see the typical profile where this is the quality scores of the reads per base and you see that the quality although goes down overall the profile is quite good. So I mean, it's too small to actually see the values but you see that the quality of the read is quite good. So you have all sorts of different metrics that you can look at. This is a particular view that shows where on the slide the quality of the reads depending on their position on the slide and you see again all of the green and yellow which is a good sign. The second read on, again, one of these data sets that we're going to be studying in the practical is also pretty good but there is one thing that's flag which this is still overall quite good you get a sense that the information of the per tile quality values sort of there are some regions in the tile that had overall lower quality score. So that already sort of indicate that maybe it's still I mean in this case it's still quite good but again if you're going to be comparing multiple sample it is a good idea to double check some of them or go back after you've called if you see that you have lots of variance. Yes, so I guess I'll get to that. So depending on the patterns that you see from these results then you can definitely trim or remove based on... I mean one challenge to be honest when you look at this I mean again these are very good reads you'll frequently get so profile where you have some red and then it's like is this a problem or it's not a problem again I think it's for the most part the variant calling itself will be robust to some of these slightly lower quality reads but it's really a question of seeing for instance if one of the sample has an horrible profile compared to the other ones you want to know that upfront before you start doing all the variant calling. So this is... I put the emphasis here on the quality of the reads but again I mean even if you sequence you get sequences from the genome center where I work I mean don't trust us and look at the reads that we send you before you start your analysis. So the quality score is one thing in some case there's the presence of adapter sequences in your reads and then that's another thing I mean this is a I guess an old figure of a Selexa read but basically in the sequencing process you'll have adapter that are alligated to the sequence that you're interested in in some case the read actually goes into the adapter so looking for adapter sequences in your reads is another thing to watch out for that would be flagged by a tool like FastQC. So looking for an overrepresentation of sequences this also can be an indication that there was a problem with the sample that you're looking at. So what happens to answer your question and please don't hesitate if you have other questions but to answer your question if you do identify issues with some of your reads there's lots of tools Trim Pneumatic is one of them there are many others that actually allow you to remove some reads or trim some of the reads you know one thing that used to be quite common was to trim from the end of the read that typically are lower quality so you would actually start trimming your reads from the three prime end based on different quality metrics so again you know for most purpose actually reads now are very high quality and it's not as necessary to trim the reads because again the downstream bearing calder will be robust to having some lower quality reads and we'll still be able to have high accuracy but this is still something to keep in mind again given that in some case you might have some samples that are older or lower quality okay so this was really again just sort of an introduction and warm up before we jump into the actual variant calling so again you know you've covered the mapping quite extensively yesterday so but I'll go back to some components that were presented in the context of the mapping module to explain and show the impact that those have on variant calling so this I'm talking about variant calling variant calling so I mean maybe I should have had this slide first this is really the goal of this module right so we it's quite small I hope you can see or on the handout but this is a view from IGV at the top you have potentially a tumor sample you see all of the reads and they all have a difference or most of them have a difference at a particular position in this contrast with the normal sample from the same individual where all the reads actually map that reference genome so this is really the information that we want to be extracting from all of the reads so we've mapped them and now we want to be able to distinguish cases like this where there's clearly a difference I mean you don't even need a fancy algorithm to identify that there's clearly a difference at this position in this sample relative to the reference and this contrast with the sequencing errors that are more randomly distributed sort of much more sparse in our spread throughout so this is what we're after so how do we extract this systematically from our data so this is another representation or cartoonish representation of the same thing so all of these are reads we want to be able to detect whether they're SNPs, whether they're variants or they're somatic mutation where multiple reads are supporting the fact that this is a variant relative to the reference and distinguish that from the sequencing errors that we expect to be randomly distributed as I said so there's so we'll be using a few things I mean one of the things that we'll be using to actually call variants will be multiple the amount of reads that are supporting a change so multiple reads that'll be one piece of information we're also going to be using the base quality itself so that's why I was saying actually most read caller, variant caller the trimming step is less critical because you do take into account the base quality of the base within the read whether it's supporting a variant or not such that lower so we can make use of the fact that with every read and every position there is a quality score that's associated to how clear the signal was and we can embed that in the variant calling algorithm uses that information so in other words reads that actually are calling a variant with a high quality score will be weighted up more strongly than when the base itself within the read was already also ambiguous so actually the next slide is slightly different from what you have in your handout and that's pretty much the only equations that I'll be showing but I wanted to at least scare you a little bit in terms of so this is not going to be part of the practical per se but again there's a number of variant caller we're going to be using the GATK variant calling variant caller but most of them work on some of the same principles so that's why I wanted to at least give you a sense of what those principles are so most of the variant caller are using this Bayesian modeling they view this as a Bayesian problem where you've got all of this data and what is the probability of a particular genotype given that data we don't need to go into that too much but what we're interested in is knowing the probability of the genotype being a variant given that all of the read data that we have and what this does is really again it's not so critical the Bayesian rule actually reverse the problems and says a given genotype was the probability of the data and then you integrate over all the possibilities again that's not so critical one thing that's also shown here is that you have to take into account that the human genome is deployed so you've got heterozygous sites so you need to be able to call sites that are knowing that there's actually two haplotypes in the genome but again what is the probability of a given genotype given all of the data we have another component that goes into that are these quality scores that I mentioned so the fact that the particular basis of a low quality is going to be taken into account in the caller so again the purpose of those two slides were just to scare you a little bit from looking in the algorithm itself what's coming now so again the general principle is really just this is a way of integrating all the read information and then giving a p-value of there's a variant here so what I think is probably more relevant to you is really looking at what actually is affecting the quality of variant calling so I'm going to go over four of the things some of which you actually were doing yesterday in the mapping step to explain how that actually improves and affects variant calling and in the lab that's also what we're going to be covering so local so I'll go through them one by one so local realignment after you've mapped the reads you know one step that is usually done is realign the reads locally because initially so initially all the reads are mapped independently to the genome and so you get a first mapping of those reads but one thing that's been shown is that when you actually have indels you have here the mapping of individual reads ends up being imperfect and often makes mistakes close to these insertion deletions so one step that's quite important is this indel or local realignment which is typically done around indels or around places where there's lots of variants so after you read individually to the genome you actually look at all the reads all at once and figure out what's the best local mapping of these reads and you see the problem is that before the realignment you actually end up making systematic mistakes in the read alignment that are going to lead to false variant calling. These look just like quite similar to real variants you've got multiple reads that are saying the same thing but they're actually all making the same mistakes so local realignment is one step that affects variant calling quite a bit and so in the practical we'll actually I'll show you the effect of that realignment step so that's one thing that's important for variant calling is duplicate marking so another thing that can make so if you don't do this step you might get multiple reads that are actually really the same reads that's been PCR amplified for instance and that you've got multiple reads that are identical but because you're going to be feeding in how many times I see this variant into the algorithm you're going to think again that this is a variant even though it's just the same read that you've read multiple times so it's important in the context of variant calling in particular if you have amplified DNA that's been sequenced you can remove reads that are identical with their start and end as you see here because you actually only have one observation here and so you don't want to be pretending or sending multiple copies of that observation to the variant caller because you're likely to get again a false call so that's a good question so for any seek it's trickier so typically you don't want to be removing duplicates because for highly expressed gene you might actually get multiple observations that are identical in terms of their start depending on the type of read that you have variants well then I would still recommend removing the duplicates because you would expect or hope to see multiple independent reads covering the variant variant calling in RNAseq is actually quite tricky in part because because of the alignment steps so in RNAseq data you have because of the exon boundaries this effect happening quite a bit as well where the alignment of RNAseq reads on the genome around the edges of the exons are very similar to this and sometimes will lead to also false calls but to answer your question if to do variant calling in RNAseq data I would still then remove the duplicates because you would want to have multiple independent reads supporting that variant because otherwise I think you quantify differences in expression or quantified expression yeah same thing I think because again for the most part right so I mean it's only the highly expressed gene that will give you really identical reads right so by removing the duplicates what you do is you're actually perhaps lowering the expression of the most highly expressed genes because now you've got identical start in repeatedly but I think that's less of an issue than actually getting false calls and false positives okay so we talked about local realignment duplicate marking another one is base quality recalibration so another thing that's been shown to be important is not to trust Illumina I guess or whatever sequencing company provides you with quality scores I mean I'm sort of have joking but it's just it's been shown that you know so the instruments makes an estimate of the quality of the reads but what's been shown is that it makes systematic mistakes so so depending so they're based on the reported so they've tested and they've shown that the reported quality is really higher than the actual quality right so it's overly optimistic and that is not the end of the world but the problem is that it's overly optimistic in systematic ways as well so if you look here for instance there are specific types of calls where again the quality specific dinucleotides where the quality score are systematically off so again so this step actually you know looks at all the reads that are mapped on the genome looks at known SNPs and actually adjusts and corrects the quality scores to remove these artifacts so again this is sort of a subtle difference but it's it also improves the varying calling to adjust the quality scores based on real you know based on the data itself so the last the last step that can improve varying calling and this is I guess slightly more advanced and maybe not used as much because for this you need to know a bit more about the population structure so I'm putting it here to give you so that at least you know that this exists and that this is there so everything I've said so far I was really talking about doing varying calling one side at a time right you look at one base you look at how many reads are saying this base is different but I have the first mini-practical of the day I guess let's see if you're awake and if the coffee is kicked in so I mean as you most of you probably know you know the genome so we all share you know blocks of DNA right so we inherit blocks of DNA such that everything is inherited as a haplotype so you've got real whole blocks of DNA that are passed on from generation to generation so in a given population you'll only have number of haplotypes that are available in that population so if you have that knowledge that there's only these two haplotypes in that particular population you can use the information about flanking positions to improve your calling at a given position so if you see that all of your reads at other positions for instance are this A and G so can you figure out what is it that's right so you can so this is making use of so you have to make a hypothesis I guess in terms of the population structure and things like that but you can feed that information to actually improve the quality of your calls and I say that this is a bit more advanced because this is I mean again so then you have to make more hypothesis about your data this has a big impact especially if you have low coverage so if all you have is 3x or something like that of your genome it becomes very difficult to call positions because you don't have a lot of reads that tell you it's an A or it's a G so then any trick you can come up with as much information as possible makes a difference so this again if you have low coverage in particular then using this type of population structure information can help you in varying calling and again this is something that in the context of the GATK framework was shown to have a big impact on the accuracy but this is especially when you have low coverage in your sample otherwise it's not nearly as critical so I've been talking quite a bit about GATK because this is really in the lab the framework that will follow so so the main components of that software suite is really starting from the raw reads you've got this mapping the local realignment and the duplicate marking so again all of this is what you covered yesterday in the context of module 2 I think the base quality recalibration that I mentioned we didn't do on this sample but after that you've got your line reads and then you move on to the varying calling itself which is what we're going to be doing in the practical here again if you have population information this is something that you could also add but we won't do this because this is not so much needed if you have high coverage samples so we're going to be working with a toy example just so that we can actually run things in the lab itself but I wanted to give you a sense of what doing it on the real whole genome data set would be like in the practical we're only going to be doing this in a very small region of chromosome 1 in the real setting in the genome data set after the alignment you would have the BAM files this typically for whole genome of 30x is roughly in your order of 200 gigs so this is again you've recalibrated you've removed the duplicates and you've done the realignment step so once you have the map reads you're doing the variant calling itself would probably take many hours even on the cluster with significant resources well you'll see how long it's going to take with just a tiny region of genome it's going to be quick but if you were doing it on the whole genome it would for sure take some amount of resources but basically after this step this is really the algorithm or the tool that we're going to be using for this but there's other tools that also do similar things integrate the information from all the reads in a slightly different way to call the variants SAM tools is another very common one free-paces another but after this we actually the raw variant file which is this the calls themselves and this is going to be much smaller in size so any questions so far? all good so again just to re-emphasize this in the practical I'll show you how different the variant calls are going to be depending on whether you're doing this correctly or not so why do the realignment? why do the duplicate marking at some level? you'll see how it's going to change a little bit the calls so just finishing this introduction before the lab I guess so we'll talk now about what do we do once we get this list of variants so the file so the BAM format is the file the read with the location, the mapping information the variants that we're going to get are in the format called the VCF format so we'll get to see these files so the VCF format so there's lots of headers the key thing is that all of the variants that are called are going to be called you'll have information about the specific position and then information about what is the reference allele at that position and then what was the variant that was observed and importantly you also get a quality score for the variant so the output of the variant caller is not just a variant of course but also a quality score so this is different than the quality scores of the individual bases within the read this is information about how confident the caller observed the variant there so one slightly tricky thing is that the quality score varies so the interpretation of these quality scores depends on the aligner that you use the variant caller that you use so the scale of scores of course the higher the better but the actual interpretation of these quality scores you actually have to look into the variant caller in terms of what it represents so one tricky thing is how do you filter what's a good quality score for the variants that's partially defined by the variant caller that you've used and then you have additional information on each of these but again this is going to be a file where every line after the header will be a particular variant with the information on the variant itself so it can get quite complicated in terms of the different info fields and things like that and all the details are in the VCF file format that's linked here so another so typically especially if you do this whole genome or whole exome you're going to get the very long list of variants and the first thing that you're going to wonder is how do you filter which ones are the good variants so there's really two main strategies for this so you can definitely filter your list of variants based on different sort of manual cutoffs you may be obviously the quality score is an obvious one so maybe you'll say I only trust variants that are above a particular quality score or you require a certain depth of coverage you only trust the variant so the output is really so anything that's variable will be reported and then it's a question of sort of filtering the calls themselves based on the quality score and depth of coverage for instance I mean you can do that and again it's quite reasonable to say that you want quality score again above 30 and depth of coverage of 10 or something like that to get it sort of depends on whether so if you are looking for a particular variant in a particular gene maybe you want to have everything that's potentially a variant because you're going to be going through it in detail if you're doing this genome wide and you want to start with top candidates then maybe you want to start with the more a stricter criteria for filtering and then really just saying I need 10x coverage on that particular variant and I want high quality scores it really depends on what you want to be doing another quite well another nice feature is that it has what's called a variant recalibrator which is a way of re-ranking or re-ordering the calls so each call was done individually and again it produced a score but one strategy that you can use is you actually feed in to the algorithm known variants known sites that are variable so taking for instance high quality SNPs from the HapMap project so these are true variable sites in the genome so if you feed that in to the algorithm and you can also feed in in this case of the variant recalibrator tool of GATK so you feed in high quality SNPs that are so that you can use to basically train the algorithm to see how good are my scores or how bad are my scores so you feed in these variants to know what you're missing which should be there and you feed in and DBSNP actually contains both very high quality SNPs but also contains a lot of false positive SNPs so again in the context of GATK using this information to sort of tune its scoring has been shown to actually improve the quality scores so again this variant recalibrator all you have to do is to be able to use that component of the package is to feed in a case of variants that you expect to have in your sample and variants that potentially contain some false positives and it's going to use that to re-rank your variants from high quality to lower quality but again this is perhaps a bit tricky but to go back to this you can use and for the most part this is fine some manual filtering based on some criteria that are reasonable to you but another option is to use a more advanced ranking where you feed in some data sets and GATK can learn what are the good cutoffs automatically that's right so it has both very good quality SNPs but it's also within this set there are many false positives so within this what they've shown is that SNPs that are in the half map that are in DBSNP but things that are only in DBSNP and not in half map are enriched for some dramatic false positives that have been observed in the past and have been reported but have never been validated except for that so it uses that information to know how many of that particular set am I calling versus how many of the half map SNPs am I calling and it improves the score in the ranking okay but another thing so this is actually an example from a project that I was part of where we sequenced 100 olginal kidney tumors and even after quite I think good merengue calling and stringent filtering we are still left with a lot of SNPs and I wanted to show you that just so that you know what you're getting yourself into if you're doing olginal sequencing and you're interested in doing variant calling so in this case this is 100 tumors these are the somatic variants in total we had more than half a million of these calls with a significant number of somatic calls so even after you have your set of good quality variants you need to start annotating them I don't know what on this display you've got a few things that disappear but one key type of annotation for instance to begin with is the cody mutations for instance right so within all of these variants you want to know which ones are hitting genes you want to know which ones are potentially modifying the coding sequence of course all the other ones are also quite interesting potentially the annotation of the non-coding variant is quite tricky but after the variant calling and filtering the next step is really annotating which ones are potentially affecting coding genes or which ones are potentially affecting enhancer elements and so on so for this there's again a number of tools the one that we're going to be doing in the practical is SNPF so figuring out the effect of SNPs or of mutations so here you're actually feeding in annotation from the reference genome that your sample is coming from and it's going to predict the effect on whether it's hitting the known gene whether it's synonymous, non-synonymous it's all of your variant with this kind of basic prioritization so you'll have so again there's lots of tools to do this and it's not the perfect science whether it's synonymous and non-synonymous some of these things are I guess more robust for sure but you still have issues where you have multiple transcripts so I mean this is going to be informative information but there's different ways of doing it so to continue with this workflow starting from the original BAM files now we had our raw variants so we can use GATK again to filter the variants and sort of prioritize them and identify the better one and then you can also use tools like SNPF to annotate the variants and this is going to actually give you this list of annotated variants I guess so one thing that's actually quite subtle the difference and it depends on the quality of the sample again and most of the difference are especially as I was saying when you have lower coverage and there it starts to make more of a difference one thing that we use that I don't have here but I would really recommend is that there are some tools that allow you to compare different versions so you download a particular data set and then you can use GATK on using whatever set of parameters or you can use GATK and then you can actually compare the results so again the differences will be especially if you're looking at low coverage but there are some one of the nice things about the GATK you can run it into this sort of normal mode that I mentioned this variant recalibration for instance and some of these additional more advanced features are also quite useful in terms of so that the variant calls are well yeah so the question is what would you do if you had samples with different coverage I mean I think so it depends a little bit on what you do downstream and how you interpret the variant list that you get that has higher coverage you do expect to be able to detect more variants but for most purpose you know unless what you're interested in is the rate of variants or things like that typically you know you want to extract as much information it might be that in that other sample you know I think it wouldn't be so if you were doing a comparison of RNA-seq data that's a much bigger issue because then you're comparing levels right well but with RNA-seq if you're comparing different samples that have very different coverage it's going to be trickier there potentially down sampling is going to help making a more fair comparison with the variant calls if you don't down sample all you have is basically more detection in the sample to have more coverage you know I'm not sure that would be a big issue well if that's the analysis that you're doing then yes you're right so I think that's definitely correct if you're doing cancer discovery project and the main thing is not you know looking at a total number of variants but really as a discovery of what kinds of mutations are in your sample then it would be slightly different and probably wouldn't so the next step is going to be the lab so any other questions on the overall workflow and pipeline?