 Now we're going to basically talk about in this part of the module how to annotate variants and prioritize them in relevance to and also some in part to talk a bit more about some enrichment analysis but let's focus on the goals of today's learning objectives. So basically we want to understand what variant annotations can I use, how do you understand the impact prediction models that are out there and then we're going to use in the lab one of these tools called Anavar which is kind of like a one-stop shop for variant prediction. So first we'll go through a few introductory slides about variants and genes. So one thing to consider is two different levels is the gene itself and the variant. So the questions we're looking at is the gene central to a biological process that may well be related to cancer or another disease. So things something like cell proliferation apoptosis or an extracellular matrix degradation and then things other things to consider is whether the gene sensitive to a particular perturbation. So if you think of the diploid cell, if you take out one copy of the gene, does that impact on the role of that gene in the cell? Does that contribute clinically to the disease? And then the variant, the question we're definitely asking is what the effect of that variant has on the gene product itself. So essentially the study of human genetic variation has kind of both an evolutionary significance as well as clinical applications and we're going to photograph us a little bit more on the on the latter. So there's different types of evidence to consider when we are talking about variants, variant occurrence or frequency, relative frequency of an allele in the population and its percentage across all chromosomes in that population. And then things like gene product function. So what's the role of that gene itself? Is it contributing to a biological process or a pathway? And then finally what we're going to be talking about here is the effect or to determine the effect of the variant relevant to the transcript and the protein function. Now there are different types of variants out there. Our focus today really is going to be on small variants, one to 50 base pairs, but it's also good to be aware that there are larger variants, entities out there. So typically a small variant, single nucleotide variant. Usually it's a base pair substitution. And with using next generation sequencing techniques, it's relatively straightforward to detect. Things that are a little bit more difficult and challenging to detect are the small indels, insertions and deletions. And there is a wide variety of databases that I'll be introducing to you earlier, sorry, later I should say, that will allow you to map the variants to very exacting coordinates within the genome. So I won't necessarily go through the medium or the large variants, the information is in your notes there. Now in terms of annotation or variant annotation, there's different things we're going to be talking about today. We'll talk about the different variant databases that are out there. There's three different categories. That's based on the type of information that they capture. So there's allial frequency from reference data sets. These are like a thousand genomes and the ESP database and the exome AC database. There's also dvstip, which is a resource that NCPI captures a lot of variant data across all species. This cosmic, which is more specific to cancer. And it's actually a very widely useful resource. I'll talk a bit of more about that in a moment. We'll talk a little bit about the gene mapping techniques. Also the gene product effect type. This is the impact, for example, the loss of function, the gain of function, mis-sense mutations right there. And then finally, talking specifically about different algorithms that are used to identify and predict the outcome of mis-sense mutations. And then there's other scoring effects that are out there. There's additional algorithms that can take into other genomic features that help present as well. So let's talk a little bit about the variant databases, frequency databases there. So the first one is thousand genomes database. This is actually probably one of the older databases that's out there. Essentially, it's a global project to catalog variants in the population. They've been very leading edge in terms of developing software and tools and the data formats for exchanging variant information. They're currently in phase three of the project. Their real goal is to hear is to identify all the variants with the greater than 1% frequency in the human population. It's using, it has just over 2,500 samples or subjects, I should say. They're apparently healthy. And the ethnicity is because it's a global project capture the different groups, population groups around the world. The sequencing platform that they're using is Illumina. In the early stages of the project, they were only able, just because of the high cost of sequencing in those days, they were only basically able to do about two to four times of the genome coverage. But now as the sequencing costs have dropped, they'd now be able to sequence to much greater coverage. Another very useful resource is the NHLBI. That is the National Health Laboratory of National Institute of Health, Lung Brain, and I think we also do, actually, sorry, National Heart Lung Brain Institute, sorry. I get too much, too many acronyms with NH NIH. We actually react almost funded by the NHGRI. I still can't understand what that is. Basically, the goal here is to identify variants relative to different heart lung and blood disorders. Although in this case, they're looking at a frequency of less than 1%. They have slightly different subject population in the case of they're actually looking at not necessarily healthy people. So obviously, these are people with the different heart lung and blood disorders. And they do have different subclinical traits that they can actually track and monitor well. And they have quite a lot of clinical data with that, which is actually very useful in terms of data analysis. The ethnicities, the focus quite, they're not as diverse as 1000 genomes. They're looking at African-American and European-American populations. So it's mostly a U.S.-based project. Now, their platform is a little bit out there. They're taking the approach rather than using whole-jewel sequencing, just to be sequencing the exopes. And their average coverage is very high at 110 times. BPX, EC database, the exome aggregation consortium. Again, another large U.S.-based project. It's actually being led by the Broad Institute in Cambridge, Massachusetts. Basically, their job here is to harmonize the exome sequencing data and summarize a lot of this information and make it available for researchers around the world. And their goal is to compile the largest set of exomes ever. And they're actually doing a pretty good job of getting there. Their subject population is a bit bigger, 60,000. Now, again, not necessarily healthy individuals. They're focusing on cardiovascular disease and immune disorders and some other neurological diseases, but also cancer as well. But I have been informed that they've removed certain individuals from different data sets. Why that is, I don't actually know. But, again, there is the distribution of the ethnicities within the population. It's kind of a global population of people. The platform is the Aluminium Exome platform. And they're using the particular variant calling Gata Key. All right. Now, again, another old database. In fact, I think it's younger than me, but it's been around since the mid-90s. This is DBSnip. This is actually a database-capturing variance across all species. It's a very large collection of variants, sometimes a little bit overwhelming, I would say. The only, the one very good advantage of this database is that it's integrated in with the other NCBI resources. So it's very easy to find relevant supporting annotations and literature citations associated with these SNPs. The other thing just to point out is that it does have a relatively good look-up service for variance information. I do find sometimes, although that the filtering criteria, though some claim are good, I find them a little bit challenging. And you do have to make sure that sometimes you remove certain plant variants when you're looking for certain variants within the population. Another very, very good resource is the catalogue of somatic mutations in cancer. This is cosmic. It's based at the Sanger Institute in the United Kingdom. They're basically cataloging a number of somatic mutations in cancer. They've created a very large reference database, a very nice user interface. They're capturing a variety of variant data from systematic screens. This is from the Cancer Genome Atlas, and also the International Cancer Genome Consortium here based at ICR. They also, which is nice, capture a lot of expert, manually curated variant data as well, which is actually very useful when you're screening through a lot of variant information, because there's clearly a direct relationship between the genotype and the phenotype there with all the relevant cited sources, publications, and that's actually a very valuable resource here, particularly when you identify a variant of interest. The important thing to mention is that they're also capturing the frequency of which the gene is mutated as well. Some resources don't necessarily provide that information, or they don't necessarily provide that frequency information for all variants. Cosmic does a very nice job of that. If I seem like I'm blasting through these slides, please raise your hand and ask the questions. Okay, time to just go back a slide. Excuse me. Oh, I did actually. Sorry, I apologize. Gene mapping. So now we want to talk a little bit about some of the kind of methods that are being used to identify different gene loci and obviously things like the distance between genes. So, especially these different types of things to consider here when we're talking about genes, obviously the focus really here when we're looking at Missens mutations is the location within protein coding genes. This makes it much easier to look at the impact of certain variants and then also non-protein coding RNA genes that are typically microRNAs. Some of the kind of caveats really refer to the kind of different functional relevance. This is basically things how large changes in expression may well be a consequence of particular variants and likely different variants in different cells could potentially have similar impacts but also you can also expect that different cells will have the same variant but have different impacts based upon different biological processes being affected by those variations. And we will kind of address a little bit about that in some of the talks, sorry, the talk tomorrow when we actually consider the impact of variations on biological pathways. Now, when we're mapping to the gene, different parts of the gene to consider here. There is the untranscribed region. It's obviously not translated. Sometimes it's difficult to predict the impact of a particular variant within this location. It's much easier to predict the impact of variations within the coding exons and to some degree within introns as well if that has an impact on splicing whether an intron is spliced out or whether the exon is included or excluded from the final transcript. Other things to consider are areas further upstream of the transcribed gene and also downstream as well and then the intergenic regions in between. Now, the system that we're going to be talking about, we're going to actually demonstrating the lab later, is ANAVAR. It has quite a nice priority system. As I said, ANAVAR is one of these one-stop tools for annotating variants and examining the functional consequences on the actual gene itself. You can infer a variety of different other genetic features like genetic bands, and you can apply different algorithms and look at the functional impact scores of those variants. Just looking at the table here, ANAVAR follows this different priority. Obviously, the default is to the exon and then a variety of different values are applied to different gene features or gene parts. Splicing, non-coding RNAs, UTRs, both 5' and 3', intronic upstream, downstream, and then the intergenic region itself. And what ANAVAR tries to do is to reduce the number of mutations, basically from a larger data set, filtering it through a variety of different gene parts, different algorithms down to maybe a hundred, or in some cases a handful, of significant variants. The actual mechanism is, as it follows, it removes things like frame shift mutations. It focuses on conserved regions within genes. It will move things like segmented duplications. It will also remove some other variant data from those in genomes, SNPs. It may well remove dispensable genes. These are genes with high frequency and loss of heterozygosity that are typically found in healthy patients, because they're not necessarily going to have a clinical relevance of that variant. And this is just a slide just to show that the kind of priorities are based on different sections of the gene. Now, to map variants to coding and non-coding genes, there's a need for a reference nucleotide database. In this case, and in the case of ANAVAR, we're going to be using RefSeq. This is traditionally one of the suggested databases to describe genes and coding sequence definitions. There are other databases that are unique, and each database will have their own kind of reference, sorry, will have their own genome browser. So there is UCSC, and there's also Ensemble as well. And your choice is entirely up to you. It's sure how you're comfortable with some of those genome browsers as well. I should carefully watch what I say in case I incriminate myself, by recording this. I personally like UCSC. It's been one of the resources that has been around for a long time. But then again, I curated for a previous database that I curated for. We focused a lot on RefSeq data. So it's actually, again, very useful because it's integrated into a lot of NCBI resources. And Ensemble is a nice European tool. I preferred the old interface, and then they suddenly changed it. And all the features that I got used to became default. The default features all changed. I had to then select all of the features that I wanted to use, and so I found. So I've set it high on my piece. Now let's focus a little bit on Gin Product Effect. Now, here we want to understand the actual, when you're doing a lot of, there's a lot of experimental work out there and you can study specific variants because of the basis of the contribution of that particular gene to a particular function, like an interaction or some kind of assay, or based on three-dimensional structure we can use information to understand the actual gene product effect, or how that variant changes can have an impact on a biological process or a pathway. So there are some regulatory effects. Sometimes these are difficult to establish what change means. Obviously some cases are easier. If it's known where a mutation is and it's within a domain, then you can potentially say that, well, if there's a mutation and this domain structure is changed and it no longer binds to an interaction, then there's the impact. Typically that occurs most of the time with protein coding sequences, and I would say for the majority of the time it is easier to chase after the protein effects. When you start looking at UTRs and intergenic regions, it's sometimes very difficult to understand the impact of those variants. In fact, there's some studies that even show that once you get out of small populations and you actually go to global populations, a lot of those initial studies kind of break down and you no longer find the same significant variants related to a particular clinical phenotype. Typically when we're talking about the change that a variant makes on the protein coding sequence, a variety of different types, there's the stop game, which is basically just adding a stop codon which causes a truncation of protein. Now in some cases that may have clinical impact because just as an example, say like notch one, if you have truncation within the C-terminus of the protein and you no longer have the pest domain, that protein is no longer being degraded, so it actually will stick around much more. So a particular stop codon can have a dramatic effect and actually that particular classification has been linked to ALL and AML, which are two forms of the chemias. There's also a Frameshift indels, so basically we're seeing a shift in the reading frame and obviously from that point forward, protein translated incorrectly from that point. It obviously has a particular impact if the C-terminus or when the protein has a particular domain and you no longer have that domain, which is involved in a chemist's activity or an interaction with another protein. Splicing, so you will affect there whether you're going to have a splice event, actually have a splicing event or whether you do not. There's also inframe indels, which will remove or add one or more amino acids. Again, if that's an un-structured part of a protein, there may not necessarily be effect, but again, if it's in a domain, that may have a drastic effect on the function of the domain or an interaction. The stopless, which is a loss of a stop codon, so you're actually getting transcriptional read-through and so what you're going to essentially have is an extra piece of protein and again, the impact of that is questionable. There is also mis-sense SNVs, where essentially this is probably one of the biggest groups where you have modified just one amino acid and then you finally have synonymous mutations where there's actually no amino acid change, but there could well be additional effects on the transcript that may not necessarily be aware until you actually look at the protein itself. Sorry, once you look at the transcript path and looking at the protein, so you may not see a loss in function of the protein, but you'll see maybe some other deleterious effect because of the transcript either being more stable or the fact that there are also loss of function variants. Definition here is stop gain, frame shift and splicing. They're definitely more disruptive, but the question here is what percentage of the protein is affected by these variants? Are there multiple transcript isoforms? We do know that, obviously, that genes are transcribed in different ways and so we don't necessarily always know whether that transcript being expressed in a particular cell type, but that variant has particular impact because the splicing, because the transcript isn't necessarily being expressed. Splicing events are pretty difficult, are somewhat difficult to predict and we do get sort of cryptic splice sites which are difficult to ascertain what the effect of a loss of function variant. And sometimes frame shifts can be rescued by another frame shift. So when I used to do yeast genetics years and years and years ago, my job was actually to do metagenesis and so you'd spike the PCR reactions and there was times when I actually would get this. We were actually going to frame shift further on and there was a second mutation and actually rescued the first. And theoretically, I mean, you can run gels to actually look at those frame shifts and you can see them and you can sequence them to obviously identify those mutations. But the actual functional impact of that mutation is a question of whether it actually truly does cause a loss of function or whether you actually, one frame shift is rescued and the other. So obviously the focus here is on mis-sense variants. When you're doing what this says, tell me more. I'm going to ask you what you meant by this. But really, how do we tell if mis-sense alters a protein function? And that's the $64,000 question, I think, for a lot of us. Obviously, if there's a distinct mutation that causes a change in the amino acid, then there's possibly a constant, there's an amino acid side group, the R group is very different and it could have a very deleterious effect on the protein. Other things to consider are the conservation of particular residues across species. If, you know, the assumption is that, you know, proteins, amino acids are conserved across species when they have functional role. So if you have commutation, then maybe you will have a larger impact on that role of that protein. Particularly if it's in a conserved protein domain. There are things like secondary protein structure, things to consider, whether you impact on this. But I think most of the time, really, the big question is what is the impact on three-dimensional protein structure, particularly with docking and drugs, interaction with drugs. We'll see, because we've just written a grant for this for a reactant, that there's not that many pipelines which actually take into consideration to try and predict variant impacts and then putting that in the context of a pathway. Most of the time when you're looking at, and certainly when we're looking at the tools, you're going to have a list of variants, you're going to identify those that have, you're going to predict which have some functional impact. And then you're going to look at the annotations that are associated with different databases out there, whether that's gene ontology terms or pathways. But that's just a term. It's not telling you the consequence or the mechanism that could well be out there. So what pathway databases try to do is, you know, once you, your segue in is that there's a particular variant that's part of a reaction. The question is, what's the downstream effect of you changing that particular biological event? Do you get the same products at the end of the metabolic reactions or do you upregulate a signaling event? And that's actually something we're trying to do with Reactome. And maybe, you know, once we actually get the project, if we get funded for the project, there won't be more to talk about then. And then there's also other functional consequences of mutations and that is affecting post-translational modifications. In the case of signaling that could well be phosphorylation, but there are a variety of other post-translational modifications that have a huge impact on, you know, epigenetic effects. So that's like methylation, acetylation residues could, that could have a huge, excuse me, those variants could have a huge impact on different biological processes. And then one other kind of more computational issue is the difference. Obviously, with all of the variant prediction methods, there has to be a training. There has to be a computational approach, a machine learning approach. And the question is what kind of training set do you use and whether you include positive and negative data within that training set. So obviously a positive data set is that there is a causal link between these variants and this, you know, there are variants with a clinical impact. And then the negative data sets are things where it's not being demonstrated. The difficulty is that we haven't done all, we haven't looked at all variants in all tissue types and all diseases to know specifically. So sometimes the negative training sets are not necessarily the best approach to follow. And just an example here, just in the graph, it's just showing a particular mutation for BRAF failing to, no, in this case we've got glutamic acid change, which is a very distinctive change. So it is being linked to a pathogenic phenotype. But then we have these other mutations failing to anality where you have basically an unknown role. So it has been tested. The question is what's the impact of this mutation? So I would, you know, if I were, well, again also, and that's another thing to think about is the tissue that you're in. Because sometimes, just hypothetically, that BRAF failing to glutamic acid mutation or substitution could well be occurring in one particular cell type where it's been studied. And it's quite possible that a different tissue in the body could have that same, same variant mutation. But it actually doesn't have you're basically affecting a different biological event. So the outcome is different. So that's just something to bear in mind when we're looking at variants. Now we're going to focus a little bit on conservation and some of the different variants scoring methods. So conservation is really a powerful, broadly used idea. If, and really the goal here, I mean, premise assumption is that if the change in the conserved nucleotide is conserved, then it will have a, you know, could well have a kind of functionally significant change on the impact of that protein. And there are different approaches to actually score that information. Oops. So they're at the UCSC. What do you want to take it? So this file of p-score, which is very useful to assess single variants. There's plascones which is used to assess with the in-putative regulatory regions and not quoting, and not quoting regions. That's actually quite useful when trying to understand the functional impacts of variants and non-coding regions. And then obviously there's other multiple species alignment tools that are out there that are used. Now UCSC Genome Browser is one of the browsers I like the most. It's a very nice way of looking at different gene features and it's, well, in this case, if you're familiar with them, they have different tracks. So you can select and play different levels of information about the genes. So you can look at particular coding exomes, ETRs, and, you know, you can look specifically at nucleotides within the codons. And you can see, well, here's essentially at the bottom here is the conservation of the amino acids and you can see the different tracks. Sorry, different tracks represent a protein sequence, amino acid sequence across different species. Obviously the top here are the conserved eukaryotic, higher eukaryotic species and then as you kind of go down into other lower eukaryotes. So, final key is basically one of the tests that's used to detect nucleotide substitution rates. Basically you require an alignment of the sequence. Scores that essentially are being the score itself. Typically if you're looking at a conserved variation or a nucleotide substitution the scores will be greater than two. If it's a neutral score, then we're going to see zero and then for negative things we've got conserved then they're more divergent than neutral. We can also break down into different score groups. Species are all vertebrates and different mammals and primates. Sorry, what's the neutral drift? That's a good question. And I'm going to have to defer to my colleague Yuri to actually answer that one. Apologies. I should remember my evolution with genetics. I teach. Well, genetic drift is just the idea that you're having something outside of natural selection causing a change in the allele frequency in the population. Classic example is like the volcano that kills off a bunch of people and therefore you've got sampling of only a subpopulation and so suddenly your allele frequency has changed. Presumably neutral drift would be that that's not happening. Thank you. One of the main caveats is if you use conservation for a given position it will not tell you directly the effect of the variance but only if the position is important. That's it. It just tells you whether and that relies on a load of other testing which I'm going to talk about in a minute. So when looking at missense variant effects different scoring models are here some criteria that are used are what features are used whether you're looking at amino acid conservation whether you're looking at specifically physiochemical properties and then the different types of machine learning techniques you could be using and as I mentioned earlier the choice of the training set that you're going to use with that machine learning technique and actually the example given here is things where you're looking for when you're comparing gain of function mutations versus inactivating loss of functions as well and then things like Mendelian disorders where you see prevailing loss of functions versus somatic mutations which are significant in cancer so the first method is SIFT it's the first widely used designed for identifying deleterious mutations to say disruptive protein function there is sorry so basically the approaches to start off with the initial of protein sequence they do a side blast to look for similarities in the sequence do a multiple line of the sequence identifying orthologs and paralogs and after that creating a matrix based on amino acid residues probabilities and for every residue there's an amino acid probability it's re-weighted by the amino acid diversity and its relative position and then finally you get a probability score where the observed amino acid is normalized against the residue conservation now I have to be honest that's a lot more technical information I can explain so I'm going to have to again defer to Yuri to explain this one later actually Yuri is probably going to listen to these slides and probably tell him I did tell him where to go so polyphen is polymorphism phenotyping and actually it actually adopts a variety of different sequence features and it considers a lot of other relevant gene parts so protein parts because we're talking about domains so to get to consider things like domains interactions and three-dimensional structures as well as the gene parts that I was talking about earlier this is where it gets interesting because they are trying to use two different types of training sets as far as I'm aware it's the only one that actually has made some good progress with the positive and the negative training set within the the humdiv which is the human diversity set so in the positive they are looking at particular alleles that have an association with a known Mendelian disorder and that information is actually getting pulled in from Uniprot actually Uniprot is one of those really interesting resources out there that has a lot of metadata associated with these different and actually a lot of cross-references between what you traditionally think is protein sequence and other phenotypic annotations it's actually very useful to look at sometimes and then they've also got a negative training set which is looking at non-damaging differences between human proteins and their related homologs in other mammals the human of our training set again has two different types this looks at all human disease causing mutations from Uniprot and then obviously the negative is non-synonymous synonymous s steps from without disease associations so it is a richer model than SIFT but it does have a little bit more bias towards the training sets than SIFT so both SIFT and polyphen are probably quite useful approaches to take for identifying the impact of variants and then there's mutation accessor to a newer member on the block it doesn't use machine learning it does consider Mendelian traits it does link out to OMIM it uses the amino acid conservation modeling approach to proteins and incorporates protein sub-family information so it's a kind of regard as an enhanced SIFT and again an entry base score is a little bit beyond my scope so I'll leave that one for Yuri to explain and sends, overall it sends to perform pretty well at current somatic variants but CAD is a new one which I have to be honest until earlier today I did not know existed all I can really say from what I've read over this morning is that it does seem to be a lot better than polyphen and SIFT and identifying deleterious coding and non-coding sequence variations does use a different machine learning technique and it does employ both positive and negative training sets and the predictive features are very wide-ranging so we'll leave it at this graph is supposed to tell you that exactly that's what I've just said and we'll leave it there one other challenge with predicting the impact of single nucleotide variations is the effect on essentially X on inclusion it's inclusion exclusion so this is you know how does that nucleotide change affect splicing events and basically it's a very it's not as easy to to identify the impact of these variations on splicing events and some of the the actual algorithms that are out there don't necessarily work with identifying the impact of these variations in splicing regions and finally some of the impact as I said earlier is that these variants does sometimes doesn't necessarily have well the effect on the the protein sequence also has an effect on the post-translational modifications in signal events this could well post-translational modifications like phosphorylation by default Yuri actually would be probably the best person to describe this because he actually did the work on this actually but it does seem that there are certain changes within um I mean I said sorry let's say this there are nucleotide changes that affect phosphorylation and protein modification sites and typically they're not discovered by a lot of the mutation assessment assessment tools so if Yuri had been here he actually would be telling you how to actually identify that and we'll leave it there just summarize on nucleotide level conservation a simple and powerful approach obviously it's important to look at multiple alignments had other gene features and when looking at the impact of variants the miscan scoring models are powerful much more powerful than identifying regulatory elements within regulator elements or in splicing regions and other things to consider when you're actually reviewing variant prediction it's conservation, the effect scores from the different models I mean our acid changes the sequence context whether you're looking at a protein domain whether you're actually getting clusters of somatic variants whether they're clustering to a particular protein domain and obviously as well as we're going to talk a little bit later not to forget about the gene level information itself and also the other annotations that could all be up there so apparently we're on a coffee