 All right, let's get started. So the learning objectives of the module. Having done an experiment, a sequencing experiment, you have detected maybe dozens or maybe hundreds or even thousands of genome variants, and you really want to interpret them in the context of what they may be doing in the cancer biology. So the learning objectives of the module would be understanding what variant annotations there are out there and how we can use them. And how do we predict the impact of a particular genome variant given various machine learning tools and mathematical models out there? And then in the practical session, you'll learn to use a particular annotation tool called ANOVAR, which you can use in order to translate genomic coordinates to protein coordinates and then further understand of what the variants may be doing in the context of that protein function. So when we have performed next generation sequencing of cancer genomes and extracted all these variants and some of those variants will be related to genes and others will be related to intergenic regions, we have to consider information at two levels of organization or logic. One is on the level of gene. Once we have identified a gene of interest that seems to be mutated frequently, then we can ask whether the gene is related to a process involving cancer, such as apoptosis or cell cycle. And we can investigate whether the gene is known to be sensitive to perturbation. For example, we have CRISPR and SHRNA screens out there that we can use to understand the gene's function. But before going to the level of a particular gene, we really want to know the impact of a variant. The variant may affect the gene, it may affect the coding region of a gene, or maybe it affects a gene that doesn't even produce a protein, so it's a non-coding RNA. And all these things we can do using various bioinformatics tools and we'll walk through some of them in the following lecture. So when we have different evidence about variants in the genomes, in particular cancer genomes, then there are several different facets that we can look at. One is variant recurrence, so it's a very powerful concept that if you analyze many cancer genomes of unrelated individuals, these are the somatic genomes, if a variant tends to recur or occur in many of these cancer patients independently, then it might be a driver variant. This is because the genome is a large and low-list space, so if a mutation happens to affect the same region over and over again, then it might be a driver mechanism. On the other hand, once we have identified a gene, we may start to build biological stories or hypotheses of what that gene may be doing, if it's activated by a mutation or repressed by a mutation, and then we look at the pathway and network context, which is part of the second lecture. But right now, we're more interested in the particular variant, so a single nucleotide variant or an indel in a gene, and we want to predict what that indel is doing in order to affect that gene biologically, or maybe it has no role whatsoever. In the genome, we have many strange things, in particular in cancer genomes, and we can classify them by their size. That's one of the simplest specifications, so on the small scale, that's probably a single nucleotide, or maybe dozens of nucleotides, maybe up to 50. The most common variant type is the single nucleotide variant. It's a one-base pair substitution where one letter of the DNA alphabet is replaced by another letter. These single nucleotide variants are probably the most high-confidence data that we can extract from cancer genomes. Various algorithms are out there that detect single nucleotide variants, and many times these algorithms agree upon each other's output, so it's easy to form a consensus set of variants that appear in a genome of interest. Small indels are insertions and deletions, sometimes a few nucleotides, sometimes a little more. These are more challenging to detect, and various algorithms don't agree as well on their different outputs, so the more stringent analysis would combine the output of various indel-calling algorithms and get a consensus analysis. We are mostly focusing on the small indels in this lecture, in particular the single nucleotide variants, but also smaller indels. On the medium range of hundreds to thousands of base pairs, we have different types of elements such as insertions, deletions, inversions, translocations, and so on, and sometimes there are complex rearrangements that combine multiple types of these alterations in cancer genomes. These are the most challenging to detect because the algorithms look at various things. The detection of these different elements and modifications and rearrangements is also due to the sequencing length, so many times they don't cover the entire read. When you map these types of medium-sized elements to genes, you don't necessarily use entire coordinates of genes and rearrangements, but maybe you see if a particular rearrangement only captures a fragment of the gene and you could already start to hypothesize how it affects the gene itself. On the large end, we talk about copy number alterations. One of the thresholds could be 5KB or more, sometimes copy number alterations affect entire chromosomes or chromosomal arms, so we use cytobans in order to annotate them to the genome. Copy number variants are relatively easy to detect using microarray platforms and a little bit more challenging to detect next-generation sequencing. When we start to look at variant annotations, there are various steps that we need to do in these pipelines or various components that we need to understand. One of those components is the data that we use for mapping these different variants. When we find the variant in the cancer genome, we first want to relate that back to the existing knowledge, and for that we have various databases such as the 1000 Genomes Database, the ASP6500 project or the exact contortion dataset that essentially tell us how frequent the variant that we found has been found earlier in studies, in particular healthy individuals. Now there are other databases that capture cancer genome sequencing, Cosmic is one of them, it's a mixed bag of various genome studies, it could be smaller scale studies from earlier days or very recent whole genome with all the exome sequencing studies. Now step number two, once you figured out how frequent the mutation is in an existing dataset, you really want to map it to genes because the entire genome has many different things and genes are definitely among the more interesting ones, but you don't really map your variants only to genes but genes have different components to them as well, like untranslated regions and protein coding regions and promoters and perhaps unconscious. Once you've mapped them, mapped your regions to genes or decided that they don't overlap any gene at all, you may want to predict the impact of any given variant considering the features of DNA around that variant. And there's a number of different tools and algorithms that do that, for example SIFT and BOLUFEN and mutation assessor are some, these focus on protein coding sequence. And there are other effects scoring methods, for example CAD is a method that provides a score for every nucleotide position in the genome, so it provides literally billions of scores and then there are many other ways to interpret coding and non-coding variation. This is a very active topic of research, for example you could look at splicing regulator predictions or you could look at transcription factor bound regions in the DNA from ENCODE, make predictions of whether these mutations, perhaps all the gene regulator motifs and so on and so forth. Alright so the first step that we should discuss is the variant adaptation databases and allele frequencies. Now one of the first and most important large scale projects is a thousand genomes project, which is no longer a thousand genomes project because the latest phase contains 5500 individuals. The goal of this project is to capture human genome variants that occur at a relatively high frequency, so more than 1% of the human population in various ethnic groups, for example you know Europeans and Latin Americans, Africans, Asians and so forth. And then this project is focused on the entire genome, but the entire genome is large, so the coverage of the entire genome is somewhat lower and then the exonic regions are covered at the higher rate. Another dedicated project that's focusing on exonic regions is the ESP project, 6500 individuals have been profiled in that data set and it's important to know that that data set doesn't necessarily include only healthy individuals. The goal of the project is to discover cardiovascular disease and lung disease and blood disease variants that are perhaps low frequency, less than 1%. So if in your study you are using that data set as a filter you just need to pay attention to that this is not necessarily healthy variation, some of the variations that you capture may be overrepresented relative to the general population because you're ultimately looking at the disease cohort. Now EXAC is a very recently published data set that the goal is to capture the largest exome sequence in data set ever. This is about 10 times more data in individuals than the previous data set, 60,000 unrelated individuals and again these are not necessarily healthy people. The cohort includes various disease cohorts so they have tried to capture every possible exome sequencing data set and process it uniformly. So when you use that as a control group for example then just acknowledge that these variants may be disease variants as well. On the other hand, DBSNP is not a single study that has been processed uniformly but it's more like a meta database of various different studies that have been accumulated over time. That includes submissions before and after the next generation sequencing era so you could see that some of them will be small and based on maybe some ancient sequencing technologies and others are really high throughput next generation sequencing at home genomes. Now it includes polymorphisms found in a general population so it's a database of healthy individuals but at the same time it includes disease associated individuals and also cancer. So it is a good resource to look up your variants and maybe study them in detail but the moment you start using it as a filter you may you may be filtering out something important. For example because it contains somatic variants in cancer if you discard anything that comes from DBSNP you may actually you know remove some interesting results from your data. Moving on to cosmic, cosmic is a reference set of anything mutations that have been discovered in cancer genomes and similar to DBSNP it contains earlier submissions and later submissions and it's a mixed bag of various studies. So again when using it as a filter you should be paying attention on the other hand if you find a really interesting variant in your data as well as cosmic you should follow up on it because recurrence is such a powerful feature in detecting cancer genomes just go out there and count how many studies actually found that variant. If it's just a single study or a single patient it's likely to be less important for your research. If it was found in dozens of patients and many different types of cancer it's likely to be more important. If you find a mutation that overlaps with the cosmic database go and see if that particular nucleotide has been frequently mutated or if the nucleotide isn't a hotspot mutation and maybe the entire gene is frequently mutated. If a gene is frequently mutated it may be indicating that is important cancer driver gene. However there are some genes that are apparently very frequently mutated but they're not long they're not cancer genes. An example would be the Titan gene that is an enormously large gene and will contain a lot of mutations but none of them are statistical significance. All right so once we, yes? Okay just ask one question about the meta-databases. Like if the reference genome is changing over time are they updating the variants? I wouldn't be able to tell you for sure but this is an important aspect to pay attention to. If your variants that you have been annotating to one genome version you need to make sure that the database that you're looking at is also using the same reference version. I would say that the databases will come in different setups so that will be an ensemble database that uses the reference genome 37 and another ensemble version that will be using the reference genome 38. So if your annotation pipeline went through 37 you may need to make sure that you also use 37 for follow-up interpretation otherwise they won't match at all. Right any other questions? Just to make sure I understand you correctly, if you're using the DBSNIP as a filter to filter out the also interesting variants, DBSNIP also has some kind of lag. So that's what you said we should be careful. Exactly. You don't want to throw them out because if you took DBSNIP as a repository of healthy variation then that wouldn't be entirely correct because there is also disease and no data variation in there. Alright once we have established the frequency of variants perhaps in the healthy population or whether these variants have been associated to some cancer samples earlier we definitely want to ask whether which variants are associated to genes or various regulatory regions of these genes and which variants are clearly the genetic and perhaps noise or at least more difficult to interpret. So when we talk about genes most of the time and we talk about protein-going genes right there's about 20,000 of them in the human genome and that number fluctuates every once in a while but there's also a large number of genes that are not responsible for coding any proteins so there are non-coding genes that are further divided into classes like microRNAs or long non-coding RNAs or long intergenic non-coding RNAs and so on and so forth and I think according to the latest high confidence counts then the number of long non-coding RNAs is similar to the number of protein coding genes so we shouldn't underestimate these sets of genes. However they are likely functionally differently important and there's also vastly different knowledge that's available so much more is known about protein coding genes so much less is known about non-coding genes and probably some of those non-coding genes are also artifacts of just pervasive transcribed regions that are not very functional in cells and when we focus more on protein coding genes they come with different segments or elements of types of blocks in these genes that are responsible for various things so first on both ends of the gene you'll see untranslated regions that get transcribed but they don't get translated and part of the protein coding accents are probably the most interesting ones for variant interpretation because they end up being folded into proteins and active in cells. Intrans get spliced out they are not translated but they may contain some interesting regulatory elements and spliced sites are ultimately responsible for selecting which exons get incorporated in the particular cell type and which ones don't but besides these clearly protein coding elements we have other parts of the genome that may be relevant for our genes of interest for example upstream and downstream regions of the gene may be interesting in particular promoter regions that are upstream or transcription start site may be interesting for understanding genome variation but they are probably way more difficult to interpret and say exactly what they may be doing in order to affect the protein function and finally probably the largest amount of variants that you detect are intergenic you can't really say what they're doing because they're far away from any genes are non-coding RNAs. Alright when we do an annotation of variants to genes then many cases it's really simple there is a gene or there's no gene under your variant of interest but there are situation where that gets a little bit more complicated for example that could be a protein coding gene that partially overlaps with a non-coding RNA and then the question is which information do you really care about when you use the ANOVAR system the ANOVAR has constructed a set of priorities in order to annotate variants so the most important the variants in this priority system are those that affect protein sequences so exos and right after exons and splice sites come the non-coding RNA so if a particular variant seems to be affecting part of a gene but that's not a protein coding gene the next step would be to see if that overlaps a non-coding RNA and after non-coding RNA the next set of priorities lies on their untranslated regions after untranslated regions come the entrance then the upstream regions and finally anything that's intergenic but you can also ask ANOVAR to report all these potential effects so this is worth paying attention to if your field of interest is non-coding RNAs and you use ANOVAR out of the box you may lose a lot of information just because ANOVAR will give priority to coding sequence here's an example of how that really works on the upper panel you'll see a protein coding gene G1 the larger or taller blocks correspond to exons and the narrower blocks correspond to UTRs and then there's also another gene under the under the latter part of the gene which is a non-coding RNA gene NCR1. Now when we use ANOVAR in order to interpret the genome variation then anything that's in orange will be annotated as part of the protein coding gene so you see all these blocks that correspond to UTRs or exons or splice sites or entrance they will all the orange areas will be will be associated with a gene of interest if any variants occur in these regions. However in the blue areas the situation gets a little bit more complicated because the blue area corresponds to that non-coding RNA so the orange strip in the middle that I'm trying to mouse over represents the coding fragment of our genome interest and in any variants occurring in that region will be part of the protein coding gene but just just next to that there's the sequence which doesn't correspond to anything coding but it does contain the non-coding RNA so variants in that blue area will be annotated to the non-coding sequence and non-coding RNA and moving further right on the panel there's a yet another region that is annotated to both the protein coding gene in the UTR and the non-coding RNA and in this case the non-coding RNA will take priority because UTR is less less of an effect than the non-coding RNA and it's worth mentioning again that that if you don't want this behavior out of ANOVA there's a common line parameter that would force ANOVA to report every potential effect making you output a little messier but perhaps a little useful as well. All right so far we've spoken a lot about annotating variants to genes but we didn't mention that genes come from out of a database and that may actually make a difference. ANOVA uses the RefSeq database so all the sequences of genes, transcripts and proteins are all uniformly coming from the RefSeq database and that is the recommended behavior for this particular software. However other databases also provide collections of genes and RNAs and proteins and transcripts namely the UCSC genome browser and ensemble all have their own versions of the genome and the genes in these genomes. I wouldn't say that one database is clearly better than the other but what I would recommend for sure is that once you stick to one database use it throughout your entire pipeline because converting between these these different databases and the database identifiers can get a little messy you you can lose information in it and create situations that you really want to avoid. So ultimately besides mapping our variants of interest to genes and saying this particular variant is associated with that gene we want to predict an impact of what that variant may be doing in order to be able to dig a little deeper into the biology. In the context of cancer you would like to find out whether a variant causes an oncogene to become activated or perhaps it causes a tumor suppressor to be downregulated or inhibited or disabled. Generally this is difficult to do. Generally in the sense that most of the genome is very hard to interpret but the protein coding genome is somewhat easier to interpret at least a lot of good ideas have been implemented in order to understand protein coding variations. In certain cases in the regulatory genome it may be also possible to say something about the impact of variations. For example untranslated regions or UTRs are known to be regulated by microRNAs and if we can annotate microRNA binding sites together with our variants of interest we may be able to say that the particular variant was affecting a regulatory site in the UTR or maybe in the promoter region. However protein coding sequences are probably the first priority in any of these annotation pipelines and it may be easier to chase down these effects because of the protein codon and stop codons and things like that. So when we look at the genes that produce proteins and potential effects that variants may have then this is a list of variants that from decreasing in a decreasing order of impact. So probably the most impactful mutations that one can have in a protein is an early stop gain mutation that causes a stop codon early on in the protein and it truncates the rest of the protein. Perhaps there's only a little tail left of the entire protein which in the end leads to almost like a knockout effect and then that there are these studies showing that many individuals carry you know genes of genes that are truncated so everyone is a knockout mutant of something. A slightly less impactful mutation is a frameshift insertion or deletion that shifts the entire logic of breeding frames by one or two amino acids nucleotides I'm sorry and that leads to a faulty translation early on perhaps and sooner or later a stop codon will emerge so that would also lead to a faulty early truncated protein. Another impactful event would be a mutation that affects the splicing site so splicing sites are small motifs that are near exon borders and when these splicing sites are altered that would could determine whether an exon an anti-exon is included or excluded from a transcript and you can see that how that could potentially affect a protein a protein function as well. Another slightly less impactful event would be an in-frame insertion or deletion that would either remove one or more amino acids so that that has to be an indel in the male multiply multiplied by three amino three nucleotides and it's more difficult to say what that particular change could do to a protein it's maybe perhaps easier to say if it affects a well-consumed regions that we will look more into in the next few slides and then finally you can have stop losses which means that the final stop codon is replaced by an amino acid and the protein continues to be translated from the RNA and maybe leading to a longer protein that that has a different function and then the broadest class probably is the set of misest SMVs where letters of the amino acid alphabet get replaced by other letters of the amino acid alphabet finally the the least likely functional mutation would be a silent mutation which doesn't lead to an amino acid change yet there are studies that some of those silent mutations in protein coding regions may be responsible for rewiring transcription at the binding sites so even synonymous mutations may have may have an impact so loss of function variants are a couple of variants from the earlier slides namely stop gain, frame shift and splicing that can be considered highly impactful and highly dilaterious in terms of protein function but even then before you jump into conclusions and start writing up your paper it's worth considering additional features of these of these loss of function variants for example consider the percent of sequence that gets that's get altered by this mutation it can be a stop gain mutation but the but if it occurs very late in the protein then the entire protein almost is functional and perhaps the protein tail is a disordered tail it doesn't affect the entire protein that much another factor that needs to be considered is alternative splicing so that could be a situation that there is a very important stop coding in one exon but in that particular tissue of interest say the cancer tissue that exon is never expressed so it doesn't matter whether there is a stop code or not and splice saving effect is in general it's quite difficult to predict because there's there's a large number of those splice sites and we don't exactly know the function so miss sense variants being the largest class and also the most reliable class of mutations how can we learn more about any individual miss sense mutation we can look at the amino acid alphabet and various physical chemical properties of these amino acids well they're not born equal they are distributed to families and perhaps a change across the families would be more disrupted protein function a very powerful feature of annotating and analyzing variants the genome is conservation if a particular position in the genome has been the same for millions of years and many different species then it's much more likely to be important in disease or maintaining the healthy state themselves you can look at also features of proteins for example protein domains the protein secondary structure whether it's a structured regions or a disordered region 3d protein structures are available but not for all proteins and there are software tools that allow you to simulate the effect of a miss sense mutation in a 3d protein structure assuming that there's good structure information available and then there's tons of other features some work in my postdoc and in the lab is revolved around post translation modification sites which are abundant in proteins and sometimes they are abundant in regions that would be otherwise considered pretty benign but the fact that there are post relation sites and disease mutation enrichment highlights that this is an area of study that's interesting and besides looking at all these different features separately there are machine learning tools that attempt to integrate information across these different facets of data and give either classifications or scores about any individual variants so there are tools that tell you this variant is more likely to be benign while this other variant is more likely to be harmful for the protein function. Here's an example of BRAF which is a very famous oncogene with a hotspot mutation in I guess in the kinase domain where BRAF v600d is seen in many types of cancer and it's a target for cancer chemotherapy while v600a is another variant that has been observed in somatic and germline genomes but we're not really sure of what it does and when you look at this chemical map of the various amino acids you can see how you could interpret the function of this missense mutation just by looking at the physical chemical properties of amino acids v and a are very similar amino acids and they are just right next to one another they share chemical properties while the v changing to an e seems to be way more disruptive the molecule have it as a different shape it has a different hydrophobicity if I'm not mistaken and you can interpret that information by just say you know imagining the possibilities how it could alter protein structure because it's a different molecule. Alright moving on to conservation and the ways various machine learning tools use conservation to establish which variants are more likely to be harmful than others. Conservation is a pro powerful and broadly used idea and we can ask whether for a particular nucleotide or a particular series of nucleotides or genomic regions how likely is that region conserved in human versus many related species and we can use that information by analyzing essentially multiple sequence alignments for either proteins or whole genomes and it does make a difference whether you look at the nucleotide sequence or the amino acid sequence and various tools give you a different analysis whether they focus on genomes or proteins and some of those tools are the following phylo P is based on based only on conservation and it's quite useful for assessing single variants fast cons is a different type of a beast it allows you to assess variants but also entire regions and it's useful for regulatory regions as well as proteins and then multi species alignment is something that you can visually inspect and study and it's useful for understanding how the sequence has evolved and which parts are more likely to be more fragile towards variation and which which ones are very more naturally variable in general. Here's a snapshot from the UCSC genome browser it describes the TP53 gene above and you see that they are quite well conserved and towards the bottom part of the the screen you can see how different protein residues are conserved across species and in human. So when you have your high confidence variants of interest you may want to go back to the protein alignments and look what these variants look in the context of various species in the evolutionary tree. So phylo P is one of these useful scores it's designed to test whether the nucleotide substitution rates are faster or slower in the particular region or site compared to a neutral drift and you can use phylo P in the context of cancer because a line sequence is available but you can use it less when you focus on exotic model organism. Phylo P scores are either positive or negative if it's a positive score it can be considered conserved and you can use a cutoff such as the value of two to decide that some regions are highly conserved. Zero means neutral and negative scores mean that these sequences are evolving faster than the background rate. Phylo P scores are available for various species and it will determine how deep your conservation is. You can look at the only primates or all vertebrates or something in between. The problem, yep. If it's a positive number it's more conserved. If it's a negative number then... So one of the main take-homes of conservation is that it will tell you that the position is important but it won't tell you whether the change in the position is important or which change in the position is important so only know that that region is important in general. Additional machine learning tools and scores will try to improve on that and provide something more. So when you're using some of those scoring models in order to assess liberation then you should keep a few of these ideas in mind. You can use conservation in a different way. You can look at protein level conservation or you can look at DNA level conservation. One of them will restrict you to protein space and the other one will perhaps misinterpret or discount features that are only focusing on proteins. You can also study amino acid physical chemical properties which is a different angle to the conservation. However these may be also more difficult to interpret. When you use some of those tools you need to understand which ones are based on scoring so some sort of mathematical model and which ones are based on machine learning and the machine learning ones are usually more dependent on the training data so machine learning will help you classify or score variants but it comes with certain assumptions of how the data sets were constructed and used in the beginning. And the choice of data set may actually matter a lot because different types of disease variants may have completely different properties and training your machine learning model on one type of data will result on that particular type of data or variant to be discovered more frequently. For example some tools may be distinguishing activating and data function mutations in a better way where others may be better at finding inactivating or loss of function mutations. And when some models have been perhaps trained on human minglilian disorders they may not be as valid for discovering cancer mutations because it's a it's a different piece that doesn't necessarily need to lead to truncating the mutations but it could activate a protein in a particular hotspot and so on. So one of the first algorithms or even the quite first one is called SIFT. It's very broadly used and it's quite old more than 15 years now. It's designed to find deleterious mutations that are disruptive of protein function and it's not a machine learning tool it's based on scoring and the scoring is relatively simple. So first you identify a query protein of interest then you use a blast-based tool in order to find additional proteins that are similar to that protein of interest. And once you have an aligned set of these proteins of interest for each position you determine the probability of any amino acid in that position. So essentially you build something that's called a position weight matrix which is as long as the sequence you have and as tall as the amino acid alphabet it's a square and every probability in that matrix will tell you how probability is to see a particular amino acid at that position. Then you normalize that matrix and you'll see in the end of how likely it is to see any amino acid in a particular position of that protein. And now they showed in a few case studies that after normalization proteins that seem to have a low probability in a particular position of the protein also correlated with with deleteriousness in that various diseases. So it's a simple score and it makes certain assumptions but it's robust because it doesn't really depend on the particular training set or set of variants that were used in order to build the algorithm. Polofen number two is another algorithm in that family. It's a machine learning based tool. It integrates various features in order to predict the function of a variant in a protein. There are several feature based sequence based features that are input into the algorithm as well as structure based for protein structures. There's things like physicochemical properties of the proteins and also protein domains, multiple sequence alignment metrics and so on. And then a machine learning tool called Naive Bayes classifier will attempt to build the best classifier between presumed disease variants and neutral variants. So they have two sets. The most stringent one will only look at damaging Mendelian disorders and their variants. And others will be looking at non-damaging differences between human proteins and related primate proteins. And the more broad set will look at all human disease mutations from the Uniprot database and the negative set will be any non-synonymous SNPs observed in the human population. So you can see how such a machine learning method can be very powerful but it will ultimately depend on the input data. For example, if you select a particular set of disease mutations in the training set, this is what will be captured most likely in your data as well down the road. The other problem is that there's far fewer well annotated disease variants compared to all the individual variation that's out there. So these training and test sets are likely to be very imbalanced. And then this is an active area of research. People improve these algorithms all the time and try to reduce the inherent bias from these training and test sets. Mutation assessor is another tool in the family. It is not based on machine learning. It's a director of theoretical model. It's kind of similar to an enhanced sift but it incorporates more information. It's based on amino acid conservation but it specifically attempts to model conservation that's specific to different classes of proteins. So proteins are known to evolve at different rates and then that is input into the model in addition to general protein substitution rates. And then they have an entropy-based score that attempts to predict whether a protein substitution is likely to be seen or less likely to be seen. And they claim to be performing really well for recurrence somatic variants which we need to understand when we look at cancer genomes. Another tool in the family which is relatively recent, I believe it's about three or four years old is CAD. And CAD is powerful because it allows you to score variants genome-wide. So for every position in the human genome it provides a score for every potential nucleic acid change. So any nucleotide in the genome will have a score for any potential nucleotide in the genome that gets changed to in a potential mutation. Therefore it allows it to score proteins and also non-coding areas of the genome but there's a trade-off. So it doesn't incorporate all the good information that we have about proteins because it's supposed to be generally applicable to any region in the genome. It is a machine learning model, a support vector machine that is designed to separate harmful variants from non-harmful variants. Their negative training set is a set from humans compared to the human ancestral genome. However the positive training set is based on simulation data and that seems to be deliberate rather than focusing on a particular set of very well-defined elements. They define a simulation model that attempts to generate harmful variants based on our knowledge of what harmful variants look like in general. And as predictive features they use, of course, conservation but also many other metrics namely and code tracks that annotate the regulatory regions of the genome, various tracks from UCSC and so on. And they claim to be performing a little bit better than just pure conservation because they add all these additional features. And then I'll just show you a couple of slides about their performance, where the thick black line that hugs the top edge the most is a area under curve metric, which appears to be the best for CAD compared to all the other algorithms. Okay, so so far, yes, can you repeat that? I'm sorry I don't think I know what you're talking about. Okay, that's interesting. So it performs a combined score of all these various algorithms in a predicated way. It's interesting. All right, so I don't know that specific algorithm in general, but that's a very good idea to have various tools to predict the outcome and if most of them agree then you are more likely to trust your outcome. The ANOVAR pipeline that we go through as a tutorial will allow you to do the same in a naive way because it will use these various sources of data and you can almost add up the scores and then rank your scores according to the best score. I'm about to wrap up this lecture. I just wanted to mention a few different approaches that are different from conservation. That's the main theme in annotating many variants. This is a study from Toronto a few years back in science where they were predicting splicing regulatory alterations or variants affecting splicing. So the goal is to predict how single-nucleotide variants affect the exon inclusion or exclusion and they used a machine learning based strategy to understand how splicing works in general. So they were looking at sequence motifs in splice sites and also the expression of various exons in the associated genes. And then once they had that machine learning framework in place they were able to predict what mutations may be doing in the splice sites based on the previously observed mutation, the motif composition in splice sites as well as the transcription levels of various exons. The feature or perhaps an advantage of this framework is that it does not learn based on known disease or splicing alterations but it's designed to be more robust and more general than the few examples that we know from literature earlier. Another aspect how you can interpret the cancer variation in the context of proteins is the idea of phosphorylations and other protein post translational modifications that are very abundant in proteins in general. Post translational modifications are like activation switch or disabling switch in proteins, they extend the protein functions. There's a large number of them that are based on experimental evidence, mostly mass spectrometry data about 130,000 sites in proteins representing about 12% of human protein sequence. It turns out that the many of these sites are enriched in disease mutations and especially in cancer and these sites seem to have less variation in the general population suggesting that these sites are important to be maintained without variation. What's important here in the context of this lecture, the figure that you see here shows you how known disease mutations would be classified using some of those tools that we discussed earlier. Black represents damaging mutations and this is expected because it's a disease mutation and they would be damaging. However, orange and red shows disease mutations that are predicted to be benign according to these various pipelines yet they affect these various post translational modification sites, which suggests that if you use post translational modification site annotations in your analysis of variants, you may be uncovering more information that wouldn't be shown in these various tools like polyphen and SIFT and then the reason for that is the following post translational modification sites are often not conserved. They are in the protein regions that are called disordered regions and therefore because they are not conserved, they are less likely to be called harmful by these various tools that rely heavily on conservation. So having additional layer of various signaling sites will help you to perhaps annotate some of the variants that otherwise would look benign. To conclude this lecture, the main feature in many of these variant annotation tools is conservation and the simple new that conservation such as the phylo P score is quite powerful in many cases and it performs not as well as the complex scores but relatively well and it's pretty general and when you analyze your data using conservation, it's always a good idea to check back to the multiple genome alignments or multiple protein alignments to see how it actually looks. When you have these machine learning based models that distinguish harmful variants from benign variants, they can be really powerful but you need to understand their certain weaknesses in particular that they have been ultimately trained according to either known disease variants that represent a pretty sparse data set or alternatively they may have been trained using simulation data and simulation data is always based on assumptions. And when you have your variants of interest, especially further down the pipeline when you have a set of really high confidence variants, you really should work on them just case by case as well as your bioinformatics tools. Just consider the conservation, the various scores, see how well do different machine learning models agree on the predictions and you can review the amino acid changes and you can look at the physical chemical properties of these various amino acids to think how they are affecting protein function. Sometimes you have access to three-dimensional models that you can simulate to see how the variant potentially affects the structure and so on and don't forget to think about the variant in the context of the gene and the potential networks and pathways that it's involved in and this is the theme of the next lecture after the tutorial.