 All right, so I am Daniela Mejiko. I'm from the Sickid's Hospital, downtown Toronto, and I'm lucky enough to be the manager of the Informatics Facility at the center of Applied Genomics, which does a range of data analysis services, NGS data, array data, and this is distilled from our experience sifting through variants and trying to capture which ones are more important, which we do more on the research side of our activities. So here I have a nice capture of a few of the tools and ideas that we will go through. Objectives for today. So I have detected somatic variants in a cancer sample or in a cell line that was originally from a cancer in your case. What information can I bring in from the world to interpret them and narrow them down to a set that is interesting and potentially biologically actionable? So specifically we will go through what annotations are available, what do they mean, how do they use them. Then we'll have a specific part on the models to score mis-sense variants. In the lab we will see an annotation tool in action, the name of the tool is ANOVAR, which is pretty popular in the community, and then we'll take the variant that you generated from Sprelka, we'll annotate it using ANOVAR, and we will filter it and we'll get to a few genes that are potentially action from so many variants. Right, so a little bit of introduction first. These are concepts that you may have seen before and if I'm telling you something different than the other instructors raise your hand, but we should at least agree on the basics. Here I've said this is more about the design of cancer studies and how you go about interpreting the data and I've identified two cases, small datasets and large datasets. So in small datasets you only have a few subjects or a few cancers that you have sequenced and in large datasets you have many, so I said greater than a hundred and in the scale one to ten. Now if you have a few the focus will be more on looking at variants that previously reported and then trying to interpret them based on variant information, gene information and so forth. So this is more similar to what a clinician may do with sequence data for a specific patient that comes in and then it's more using the sequence data as a diagnostic tool sort of, right? Or you may have a small study on a specific cancer that's not the common, so you want to try to squeeze as much as possible out of its biology. But you also have these large datasets, okay, big consortia, doing so many samples of colorectal cancers, medulla blastoma and so forth. And there you have a much bigger row of statistics so that you can just look at variant frequencies over so many samples, aggregate statistics from the variant to the gene to the pathway and then rely more on the statistics to identify the potentially interesting variants, although you should never forget the biology of course, because that eventually has to tie in into a message about what's going on in the cell. So if you wish what we see today is very useful for this and it's helpful for this but it's not the centerpiece. And I will not talk about how you do association statistics at the variant gene or pathway level, okay? So I will focus more on how you bring in as much information as possible on the variant to then interpret it. The other big distinction should keep in mind is variant and gene information, okay? So two levels, the lowest one is the variant, which may be single nucleotide variant, small indel, adhesion, a big copy number change, okay? And then especially for smaller variants which will be the one we focus in on, depending on where they are, what type of effect they may on the protein function and so forth, we can classify them and we can prioritize one versus another one. But the other level that you should always keep in mind is that the same disruptive variant that may rule out one copy of the gene product on one gene and on a completely different gene will not have the same biological effect. So there's a lot of information on the gene level that you also have to use when you prioritize variants. I will not go into the details on this. You will know more in the lecture that follows me, especially about gene functions and pathways. So I think you've already used the word passengers and drivers and here I'm just illustrating how this plays out with what I said before. So if you have a gene that's important, the gene is an activator of the process, controls a key process, then the gene is key in that process. And on top of that you have a variant that puts it in a permanently active state because it's on a key residue. Then that combination can act as a cancer driver but you need both pieces, variant level and gene level. You can have an activating variant on a gene that it's not even expressed in that cell and that system that will not act as a cancer driver again. And the other case of course is you have an important repressor, you have a function that's loss of function, so the protein is not expressed anymore or it's in a status that's permanently inactive and then you get a driver event. Again, you need the gene and the variant. And you can get any sort of variant over genes that are redundant or not expressed or they control a process that's not key for cancer. And regardless of the variant, the variant may be in the most conserved residue but you get no effect. So never forget the gene level although we don't cover it in this lecture. Okay, this lecture will be more about variants. Okay, so also if you get a variant that's in a region that's not important even if the gene is important you get no effect of course. So this is just to summarize these three areas of information that I pointed out. Variant frequency, not today in this product, sorry, the variant will be more in this model and the gene function pathway more in the next lecture. Okay, the other thing to keep in mind is variant sizes. You've probably already discussed because you've looked at somatic SMVs, you've looked at somatic copy number variants. This is just a summary of the different variant sizes. We're going to focus on the small or short group which is from roughly one base pairs to 50 base pairs or from zero, however you count the lesions. And so we will focus on SMVs, single look-alive variants, one base pair, substitution, relatively easy to attack. Small indels, a bit more challenging to attack but never bigger than 50 base pairs, typically smaller. And there's plenty of databases for this kind of variants reporting them, having any link to disease and publications and so forth. And these are typically mapped by exact coordinates which is what we will do. But you've seen also other types of variants in a larger range. Again, insurgent deletions, translocation, complex rearrangements, gene fusions and so forth. What we see today is not fully applicable to this, although some ideas are in common because of the size, because of the mapping that cannot be by precise coordinates, because of the type of effect. For instance, an SMV can change a residue within a protein but if you have a big copy number variant that is now able to change a residue in a protein. So many logics are a bit different. Right, so before we dive in, any question for clear? So this is a summary of what we will go through in detail today. These are all the variant annotation components we will go through. Okay, first of all we will look at databases for variants. First group will be databases with a little frequencies from reference datasets. So they don't have anything to do with cancer and they're typically germline variants. So variants that are present in all the cells in the individuals that don't arise somatically in cancer cells or other cells. The name of these databases that somebody may have already mentioned are the 1000 genome, the NHLBI ESP, and the CGI 46. We will go through in detail what those are. What are the specifics of those data? How are we going to use this? Well, if these report the frequencies of germline variants and we're looking at somatic variants, somatic variants should not show up. So we're basically going to use them to throw away variants that are potentially false positives and not somatic, but actually germline. Then we have Divisnip. It's just a broad scope reference database for small variants. So if something is being reported at some point, polymorphism, germline, somatic, whatever, it may be there. And then Cosmic, that's specifically a database for somatic variants. So the kind of variants that you're interested in being this a cancer workshop. These are all databases, collect variants, some levels of annotations and displays. And the other piece after we've checked, if our variant is in any of these databases and there's usable information that we can pull out, is GMATIC. So human genome has 20,000 or more genes. We know a lot about the function of genes, especially the protein coding ones. So these are the key features to look at when you look at the genome. So gene mapping is a centerpiece. And then depending on the different portions that we have within a gene, we can map to the different portions. So for instance, coding region that codes for a protein or another gene product, UTRs which are transcribed but not translated, and then what's outside of genes. And after we've done the gene mapping, then we can check at a type of effect that we have on the protein. So does this alter the protein sequence? Does it change an amino acid? Does it disrupt the transition of the protein from a given point down? Then we will look at specifically how do we interpret a class of variants that's called miscense variants that change amino acids. What type of measures we can use to further interpret them? There are other variants that are of easier interpretation whereas miscense variants are more difficult to interpret. They can be very important but they can be a lot and it can be difficult to see through them. And if you don't have any tool, they're just a lot and they don't apparently mean anything. So that's why we have to spend more time looking at models to score them. Right, so I'm going to start from a little frequency databases. So as I said, all these databases are done with the following logic. You record the number of individuals, take blood samples or saliva samples or cell samples, sequence them and then detect variants and then you get this reference dataset on a relatively large population of what are the variants. This design is aimed at germline variants, so variants that are present from birth in all your cells. It's possible that some somatic variant may come up but that's going to be a very rare event. So they're not designed to identify cancer drivers variants. They're identified just to look at the general variation in the population with different flavors. So if we see a variant coming up in these databases, specifically in Elio, with a given frequency, you have detected a somatic SMV but look what? That allele, that T, that base substitution actually has a 1% frequency in one of these databases. It's likely an artifact, it's likely not somatic. If it's somatic, it's likely to do nothing because you wouldn't see it in so many subjects. So we'll use them as a filter to throw away stuff. And I mean, I've already gone through the names. Start from 1,000 genomes. The reason why this project was started was to identify variants that are relatively common in populations, specifically greater than 1% frequency, simply in different ethnic populations. Right now there's about 1,000 subjects that are sequenced and available and a project completion it would be 2,500. Is that enough? Well, for that goal, this is a good number. But of course if we had a million in style by 1,000, we would be much happier, especially for variants that are more unique to specific ethnicities. It was launched in 2007 and the sequencing technology has been really really progressing fast. So some elements of the design, if you look at them nowadays they don't make much sense but you have to think back at the technology that was available at the time. So here I will drink down into a few details but I've already said the key concepts so if you get lost in some of these details it's not a big deal. The ethnicities are actually pretty good. There's European, black, African, East Asian, mixed Americans. Some ethnicities are missing but this is one of the best databases in terms of ethnic coverage. Why ethnicity is so important by the way? Well, if I have a variant that's fair but only one ethnicity and for some reason I mistaken it for a somatic then if that ethnicity is not well represented I will not have a good chance of removing it. So if your somatic color is not doing a good job and making a lot of mistakes and calling a lot of variants a somatic but they're not using this database as a key you're sequencing an exotic ethnicity this database helps you less. If you're sequencing a more mainstream ethnicity represented ethnicity this database will do a better job. Right now there's 38 million SNPs about four million indels so it's grown very rich. It has a mix of different platforms so it's not unique to one sequencing platform and it's a mix of very little coverage whole genome and higher coverage exome. So also where the variants were called in the genome is a hodgepodge of everywhere almost everywhere in the genome where you can align reads and call stuff and and reaching for the the calling component with an exome capture. So in this course we will focus on the calling component but if you look in outside keep that in mind you have less support outside although you can still get reasonable frequencies and also they use a mix of different tools for calling variants. I mean is it a disaster? No but there are sometimes specific effects related to using one platform or using one color that if you're using the same platform and the same color you can have a better chance of seeing the same things or the same artifacts and so you have a better chance of removing them if you're using a completely different platform. You say may see something unique and then think it's new but it's just unique to your platform or your pipeline so that's why I'm also mentioning that for reference but as I said it's not a disaster. This is a very neat really table showing you all the ethnicities that are present but it boils down to Europeans, Eastern Asians, Africans with a specific focus on Western Black Africans and Mexicans right now available other planning for the Indians and other Southeast Asians and so on. The next database that we review is the NHLBI ESP okay. The goal of this project was not to chart the variation in the general population focusing on healthy individuals like the 1000 genome but the the goal was to look at disorders that are relatively common in the population focusing on heart, lung and blood right and then being able to tackle variants that are more rare that have a new frequencies below 1% okay and it has 6500 subjects so more than five times more than 1000 genome but it has excellent technology so it's very good for the coding regions that were captured outside that it can tell you anything unless for some reason part of the introns were captured and so on okay. These are not necessarily healthy individuals for you it's not a problem because this doesn't capture somatic variants in cancer it captures people with extreme blood pressure predisposed to myocardial infections and so forth so you shouldn't see somatic variants here because it doesn't have a cancer and the ethnicities well the Eastern Asians are completely missing so if you have a variant that's unique to Eastern Asians you're not going to see it here right now it's focused on African Americans and European Americans so it covers very well the northwestern European Caucasians and black Africans and again Illumina for variant folding again it's a little bit of a hodge podge of different methods and then the last one that we see is the complete genomics 46 and 69 which are highly related 46 is the one with unrelated subjects so it's a subset focus on this it's a lot of subjects okay but they're whole genome and they're not on the complete genomics platform now probably the complete genomics platform was mentioned to you it's very nice platform it comes with sequence data analyzed up to the variants maybe it's not as spread as Illumina but it's a very good platform so it's very helpful to have a database that's specific to that platform but again it's not a lot of subject although it's very good with the ethnic coverage right including its and the the coverage is also very high unlike the 1000 gene and this is just a detail of the different ethnicities for these 46 subjects okay so in the final lab we'll use allele frequencies from all of these databases so just a summary of this key thing to keep in mind different ethnic compositions all genome versus exome so if you're looking at variants that are not they're intergenic for instance NHLV ISP will not tell you anything typically uh different platforms the different variant colors that's a minor issue although if you have complete genomics data you maybe see variants on in the CGI for the six for instance uh and other there are more details more needed really is that there's different sequencing depth so different databases different power for variant different frequencies some of them are designed for the more common ones some of them are designed for the more rare ones like the NHLV ISV and so all these data sets in the nr component you should use all all three essentially the other thing is these projects get constantly updated uh variant calling pipeline pipelines get updated more subjects get sequenced even the capture of exomes may actually get updated some projects may switch from exome to whole genome so you always have to keep your eyes open and stay tuned for updates and look at how the updates are captured by the annotation tool that you're using right the do you have any questions so far when I say allele frequency is it a concept that's shared by everybody okay the next component that we're going to look at are the sequence variation databases dbsnip and cosmic okay so start from dbsnip and you may have already uh heard a lot about so this is a broad scope repository of variation and we look at the small components specifically okay there are other databases in a cvi that are specific for bigger copy number variants like dbvar submissions came in from the era before ngs and after all these projects derived by ngs whereas the project that I showed you before 1000 genome and hlbi esp and cgi 46 are all based on ngs technology there will be submission zero that date to the sign of technology or even before okay there are polymorphisms that you find in the general population with even higher frequencies and there are things that are more rare there are somatic variants there are variants that might not even validate maybe that was just submitted and it's a bit more iffy so it's basically a reference database of variation that has been surfaced in different status so you will find a lot of different things so don't use it as a filter if you throw away everything that's in the dbsnip you may throw away very well characterized somatic variants right which have tons of publications attached and if you want to really get into the plumbing of annotation there are actually ways of extracting from the dbsnip only the component that's not clinically associated if you want to use it as a filter like we did for the other databases which we will not do in the lab today but I want to mention it's possible and then the other reference database that we look at is cosmic which was probably mentioned before in this cancer bioinformatics workshop it's a catalog of somatic mutation in cancer so that's what you want to look at if you want to see that your somatic variant has been reported before so we're going to do the matching to cosmic entries and really the key here is to check how many studies and samples it was found in because it was found only in another one and you can imagine the support is not that great if it was found in a thousand others well you haven't discovered any new variant but if you're interpreting just what's going on at the biological level then you can rest very short with that variant is interesting right and then you can also use it instead of just looking at the single variant level you can look at the gene look how the variants are distributed there are hotspots and if there are a lot of variants reported for the gene something will not surprise you like genes like TP53 will have tones of entries okay and here there's a little tutorial on following this variant in these two databases so this is not a lab session so don't start clicking on on your laptop just follow this but you have enough details if you want to look at this later so that you you can follow through all the screenshots so I just picked this variant of VRF V600E as an example just to be hands-on on one variant this is very well established in the cancer field it's actually an activating variant on VRF which is in the map kindness cascade driver of growth proliferation survival so forth okay and we look at it in dbSNP and cosmic so I've already mapped it for you to the dbSNP identifier right because otherwise you only have the name of the gene and the position in the protein where you have an amino acid change that's not the easiest way to identify a variant although you can use those in cosmic but dbSNP typically has rs identifiers and if you have somatic variants that were called by yourself or by somebody else you start having the genomic coordinates and the gene and the type of amino acid change would be an output of the annotation pipeline so you can easily trace back to the genomic coordinates and the dbSNP identifier which would see in the lab you'll have it in the an over output so not doing any magic but taking the dbSNP identifier as a start so this is the screenshot that you get out of dbSNP has a lot of information but one key thing that we can start focusing on is the fact that you have an area called clinical channel here and this is marked up as pathogenic and unpasted alleles so what happens at this position you have the reference and then you have two alternate alleles one of them is quite of a passenger and the other one is pathogenic and this one is actually the pathogenic one i'll show you better in the next screenshot but the key message here is when you put your rs scene in the dbSNP page and this shows up it's clinically associated so it's not uh it's a variant that's being reported to be associated to some extent to disease so look up for extra information see how well it's supported then if you click on that you get to a more detail view of this clinical association report okay the key here is that at this position we have t as the reference some deal it's a t-nucleotide as the reference of deal and then there's two substitutions that are reported the pathogenic one is t2c which produces the v to e amino acid change at position 600 and then the other one is maybe somatic maybe germline it's pathogenicity and tested and actually it's a t2a which produces a valentual change which intuitively if you look in the paper of the amino acid structure the valence to alanine doesn't have a big structure change but the valence to glutamic acid adds a negatively charged group so we see that the biochemistry actually supports what we've seen in the database as reported based on association studies factual studies and so forth so this is a case where you immediately see at the amino acid level just look at it the distinction between the pathogenic and the test is what makes sense and then if you click on the pathogenic one and you follow that link you get to another database that i don't have time today to describe in detail which is called omen and omen will give you a lot of information about this find but i pulled out the key sentence which says the v600 e mutation is an activating mutation resulting in constitutive activation of v-rath and downstream signal transaction in the kinase path malignant melanoma boom there are many other entries right so this is a very good example of a driver mutation driver gene mutations activating the gene is an activator of the process okay so it's not your tp53 loss of function that removes an important repressor or dna damage response activator and then in cosmic i've created this starting from the gene putting in b-rath and then this v600 e is one of the first records that you get apologies for the lower resolution in the screen and then you can look at the cosmic entry and specifically you can see that there are a lot of entries so a lot of studies that and samples that have reported it and you can also see that there's a gigantic column on that position representing all the studies and then you also if you look at outer vines there's really a cluster of somatic variation besides the single nucleotide substitution at the same position and then a few others in the same protein of it so this is really a key residue in general even if you have something that's only reported by a few studies but then you see clustering in a protein domain that's also very interesting okay so this is the side of taking a variant looking it up one by one and we see more systematic of course in the lab just taking a big table and then doing a number of filters so this is just helpful to illustrate how we can take the databases browse them what type of information they have in of course that information can also be extracted computationally at least some of it okay so the next that we're going to go through is GMATP but before that do you have any questions yes yeah that's a good rule of thumb okay but proteins are complex structures and depending where things are the outcome may be very different so if we had protein structures of all the proteins and all the protein isoforms and we had a good complete cluster doing molecular mechanics we could go in change an amino acid see what happens assuming that there are no other effects because of that protein binding to other proteins and so forth which really requires to put the protein in a context then we would see that for some residue some residues there are in key positions very important for the secondary structure like for a double helix or for a beta sheet or in a catalytic residue and so forth or for forming a hydrophobic core we change the residue and that destabilizes the protein or changes a catalytic capability so it decreases the rate of the of the reaction right unfortunately we don't have so many protein structures and tools that are data advanced to really do this mechanistic prediction but usually if a residue is conserved and here's an notion that i will describe better later and you have a dramatic amino acid change chances are that change would be disrupted although that's based on heuristics right the missen scoring models specifically look at the type of amino acid change and use different scoring models to basically predict how this wrapped it that's going to be but of course it's always good to have a transfer model so that you look at the amino acid and you see what it does sometimes you can even tie it into biochemical publications saying I've done a lot of extensive characterization and can tell you in detail what is the effect but usually when you see those dramatic changes from a change this week you get a change this week or a pruning gets inserted or changing to something else because pruning has a particular effect on the structure a complete polarity change from positive and negative in general those are more likely to make a difference you have a valine to to glycine or leucine to isoleucine sometimes that will make a change but it's less likely right so many people at tcg in the academic group have that table printed on their desk so it's useful all right so i usually put it up from my wikipedia i should print it myself so i don't keep it's on wikipedia by the way so you just wikipedia amino acid so the other thing is the gene mapping okay then we're gonna go through next uh so you may be picky and say well why only genes right somebody will already say it's already complicated enough but somebody else may be very ambitious and say i don't want genes only i want everything right we're discovering so many things about the genome like there's ankle well besides the fact that we have only two hours today um genes are really the key thing right so really centerpiece in our understanding of the genome are the genes and very well characterized the protein coding ones okay so that's the starting point if you want to annotate variants with respect with what functional element in the genome they overlap with and potentially disrupt uh i mean of course there are other interesting things that are biologically active there are regulatory sequences where transcription factor uh bind transcription factors bind where chromatin structure modifying factors bind other factors bind or there's non-coding RNA that may be only partially characterized uh other sequences that have a structural role and they're important as spacers i mean our understanding of those elements their characterization is not at a best that we can just use them as much as we use genes right so focus is primarily on genes for that reason um and what are the types of genes well as i said before protein coding ones code for proteins proteins very different than nucleic acids right it's made of amino acids as a 3d structure so forth we know them very well lots of biochemistry papers and books on proteins and then there's non-coding which actually it's better called in this case non-protein coding and usually the gene product is an RNA so you don't eventually produce a product stop at the RNA level and an example of a family that i think is very well characterized at this stage so you can look at those are the microRNAs which are actually very short RNAs which regulate transcript stability or translability so they're actually active regulators at the transcription level translation level they're very short they have a seed that's even shorter on the 8 to 12 nucleotides if i remember correctly that seed is a highly conserved okay so this is an example of a non-protein coding gene that's worth looking at because it's well understood but there are others that are still in the process of being discovered characterized so forth so uh i'm gonna focus on the protein coding genes and then of course the other distinction that we mentioned before is um different functional relevance so some of these RNA coding genes may be well characterized and they have an important role some of them may be more iffy they may be more redundant maybe not strictly required for perfect function so the other piece is mapping to different parts in a gene and here we're using a relatively simple breakdown in parts we have beauty r's that are transcribed but not translated sequences then we have the translated coding axons which are the ones that have a one to one well one to one three to one mapping to the protein sequence from the start to the end column okay three nucleotides one amino acid these are also spliced in when you have the pre-mRNA then it matures through the splicing process spliced in some of them may be spliced out but typically they're spliced in then we have the introns which do not code for the amino acids might have regulatory sequences and are spliced out and then we have the splice sites which are these small sites around the intron extra junctions to drive the splicing process okay this is the breakdown that we will use and I have some graphics later and then we'll have upstream downstream of the transcribed portion of the gene and then when we get really far we have intergenic and ANOVA will give you a distance to the two closest genes and that gives you really far may affect those genes or not it's difficult to tell so it's a very gene centric view of course so I'm going to anticipate this slide this is a representation that you will see in the ucsc browser a bit simplified transcription starts here you have the utr have the coding axon big infant another coding axon small infant another coding axon another coding axon utr this is where the splicing happens okay and then overlapping is gene which is product coding I place here a small uh non-protein coding a non-coding RNA that's shorter and of course doesn't have any coding region because it cannot code for proteins so it's all represented in the same markup okay so the tool that we're going to use ANOVA has to make decisions when we have overlapping things right if things do not overlap well we have that categorization everybody's get assigned to one category unfortunately in some places of the genome we have overlaps right especially now the more non-protein coding RNAs are getting discovered some of them maybe are iffy but they end up overlapping with protein coding genes so either the annotation tool will tell you everything this overlaps this this this and that which may confuse you or give you too much information or it has to make choices focus on more important stuff right ANOVA prefers to focus on more important stuff and we need to be aware on what ANOVA believes to be more important so that's why this slide is about ANOVA's priority system which is unique to the tool in a sense but you see that these kind of decisions may be made by other tools it's a general strategy so it's worth uh keeping it in mind and this is specifically to resolve overlaps between different categories and number one is exonic and splicing right because if you alter coding exon you can alter the protein sequence if you alter a splice site you can alter the splicing and then alter the protein sequence and then non-protein coding RNA and cRNA comes number two so in that case before we have a coding exon overlapping a non-coding RNA the non-coding exon gets priority you will not know anything about the non-coding RNA and as we walk down the two UTRs three prime and five prime introns and upstream downstream and inter jet right biologically this overall makes sense maybe for some non-coding RNA you want to keep them in mind anyway but and then I have a slide here we'd be breakdown that ANOVA will do for the case that we looked at before we'll have upstream here five prime UTR here exonic which means exonic protein coding here then we have plus two minus two around the exon intron junction as splicing particularly the intronic part will be splicing and the exonic part would be exonic splicing then we have the intron and then so forth I didn't plot the splice sites again just not to have too much clutter and then see here when you have an intron and a non-coding RNA overlapping it will go to non-coding RNA but then when you have a protein coding exon and a non-coding RNA overlapping it will go to the protein coding exon and so forth so here I've taken the ANOVA output with respect to the gene part and gene mapping okay and then I've added the screenshot from the UCSC mapping that variant and then zooming out this is from the annotated variants that we were making the lab right so this is an example of something that's infirgenic you can see that the closest genes are the two extremes you can see the same notation that I used before from the genes with the very thin lines representing infones and then three lines representing UTR then bigger blocks representing coding exons and ANOVA gives you the distance from the two and then you can see upstream of the transcription south side of the gene and then you can see a UTR this is the three prime UTRs on the end of the gene the gene is subscribed in this direction and then you can see a five prime UTR in this direction so it's in the very beginning of the gene it does overlap the intron of an atherizer form of the same gene so here we start seeing a little bit more complex we have one gene that's KCNAB2 but that gene has multiple transcript isoforms so there are different star sites that are reported this is a shorter isoform so it starts from here with its own UTR and actually this UTR overlaps the intron of our transcript sequences and ANOVA is making a choice and telling you it overlaps the UTR and not telling you that it's overlapping the intron of course in that field the gene is the same it's always KCNAB2 and then we have typical sonic coding pattern coding and then we have a splicing so this splicing I wanted to just elaborate a little bit if you look at the nucleotide sequence which unfortunately under this screen view is a lower resolution but you would see that here the nucleotides are AG AG is a canonical sequence for an acceptor side on the intronic side and this is changing the gene so if you look at that in different species that site is actually very conserved so changing the gene may actually have an impact on the splicing but functionally you have to check what happens this is just tipping you that may be interesting okay so splice sites is something to really not to oversimplify ANOVA has a relatively simplistic look over them it's just giving you the plus minus two around the intron exon junction the intronic side is far more important you have to keep in mind how well conserved they are because if you see a lot of divergence it's less likely to really be effective if it's on the exonic side it's likely to be effective so splicing variants are really tipping you about the fact that may alter the splicing but don't take it as a tool with the capital T right if you see cases where it's on the intronic side it's an AG the AG gets disrupted the AG is what conserved there is no AG in proximity that's a good evidence other cases more iffy right in this case I pulled out again from you see a C all the 46 multi species alignment it's always AG right so unless you have a gap water the sequence is not there so this is an AG that's probably functionally important so we've already learned also about the conservation criterion that I will review more in detail later and last but not least what other ways did I use for this gene and gene part annotation strong preference and our strong preferences group is towards refsic because it's pretty conservative meaning that it will give you reasonable transcribed isoforms whereas we find ensemble to be a bit less conservative so it will give you also isoforms that we not always believe you see a C known genes is also is also pretty good all these are available within ANOVAR if you want others you'll have to make the database yourself but you can follow the instructions to do it if you have a cancer sample you start looking in the coding regions and you really focus on the variants that in this sense easier to interpret refsic would do a great job if you're really trying to scrap anything possible out of a sample then you may want to try other databases that's sort of the take home message and then of course I didn't talk about other annotation features within protein coding genes protein domains outside or in proximity or overlapping UTRs the encode project has produced a lot of profiles for epigenetic marks like histo marks where histone reside the histone is changed and then it binds to that site in the genome so if you lose a binding site that may alter the chromatin regulation so in fact you have the mark on a protein map to the gene the binding site in the genome of that problem and then DNA methylation which of course directly on the DNA in the CPG islands and elsewhere has a repressive usually a repressive effect again if you lose completely lose a site or alter it in a relevant way you may have an effect so forth but these are more difficult to interpret so if you have something that rules out more than half of a protein is much easier to interpret than this if you have something that disrupts a canonical acceptor site A and G even that is more easy to interpret than this so that's why I left it out besides the fact that we only have two hours but be aware also these things are all visualizable in UCC so you can go on on a rabid show me everything changed on UCC but it will be more difficult to assess what is wrong so keeping the pace next thing that we look at is the gene product effect type right I think this was already to some at least to some extent addressed in previous lectures here really we're going to hone in in the effect of the protein sequence for protein coding genes I mean this type of gene product effect there may be models that do it also for other categories of RNAs but again didn't have enough time for instance if you disrupt the seed of a micro RNA you may predict some sort of effect what's really established and used by many in the bioninformatics communities for product coding sequences you also have tools for other non-protein coding genes so this classification of effects is really protein coding centered stop game coding sequence you are the stop column ribos on the ribs stops ah you haven't translated all the protein so now you have one copy that's truncated if it's really truncated maybe even degraded you know you're going to even see the protein right if it's homozygous stop game at the beginning of the protein you may not see the protein at all frame shift similar effect but instead of telling the ribosome to stop changes the reading frame so now the ribosome is getting the wrong amino acids from one point down similar effect the difference is mostly in the fact that that arises from an SMV that arises from an indel SMVs are easier to detect indels are more difficult to detect especially the somatic ones so be aware of artifacts when you ran strelka you got only the somatic SMV so in the lab we won't see the frame shift splicing potentially alters uh splicing in fact we can it's better to state that alters key sites got in splicing okay this is the lawyer definition of what anovar is giving you or the typical annotation tool is giving you so you have to do some follow-up don't believe it necessarily that the splicing is out there and it's a disaster of the problem nonetheless those three get grouped as loss of function mutations meaning that they need to pretty massive effect with a caveat that if you have of course a loss of function in the end of the protein so you have a stop gain in the end of the protein that removes five amino acids that's not a big effect right unless you're really removing a signal peptide or something so keep that in mind and then you have the insertions additions that are in frame so remove a one or more amino acids insert one or more amino acids stop loss you lose a stop cotton we don't really have at least that i'm aware of tools to score these they're not as disruptive as the ones above they have some effect let's keep them in the middle tier they're not so many anyway and then you have the miscense smd's a lot of them right they change an amino acid how's that relevant well besides going back to the amino acid table and simply looking at the alignments in ucsc there are models that are used to score these amino acid changes which i will go through and then last but not least synonymous changes this don't change amino acid however i have a slide telling you that even these may actually have an effect but it's really the the end of the tier so again if you're scrubbing for anything possible look at the synonymous otherwise you have already plenty of work stopping here in fact we will not try to interpret the synonymous ones sometimes the ribosome has the ability to correct even even a stop gain needs to be validated function right so if you can detect the protein at the full level the stop gain is not really functional so for instance if you're using complete genomics data and you have smv reference smv those will be grouped in a block substitution and then the effect may not be a synonymous change i don't know if that was what you were saying or if you were saying just combination of multiple smv's at different points because if you have multiple smv's at different points again they will not change the protein sequence but they may have effects going in other directions what's in the slide i haven't shown you yet it's basically that they may change a regulatory sequence that's overlapping the protein sequence but actually controls splicing or something else so in the end again conservation right you have a synonymous that's stellar conservation you know so many species it's not changing the amino acid but it might be doing something else you have a few adults plaster it's not an artifact they're all conserved all synonymous even more evidence i don't have any case in mind when i saw that but others may have seen that so but it's more of something i can exotic uh thing rather than what you see they did uh the bread and butter basically type of line so do you think that we should be reporting synonymous than like echinotic species something that we don't see happening who's going to read the report because if you are in the clinic and you're reporting it to the patient you're going to drive the patient crazy for instance right so depending on the community if you have say that you you're trying to interpret the biology of the cancer start from the top if you can squeeze out something out of this level great but it's going to be a lot of effort compared to the amount of success that you can get out of it right i'm just thinking more for me like i always just ignore those ones so if we were in the odd industry those are the that's the odd in so the arabian these are the tar sands now and i'll burden i know how to use them with environmental effects but i mean it's not something to completely forget about but practically we often forget about them when we through violence so if there are great tools of surfacing that really are able to take a thousand synonymous variants and find the one that's relevant i'm not aware of tools that do such a great job right now they may appear in the future especially in relation to splicing in answers that's that's something to stay tuned on if there will be tools i actually tell you this synonymous actually disrupting a splicing in answer that may be a great thing to look at yeah yeah that that's but in that case you're not trying to assess the effect of the variant you're just looking at basically the general mechanics of the mutation mechanism in the cancer so what i mean you know one thing i've seen is translatable right can you look at codon usage codon usage yes personally not an expert i never had a chance to practically use codon usage but in the side that i haven't shown the other point besides a cryptic regulatory sequence is actually called a usage but i don't have any tool to actually suggest right now to look at that but it's in general just from the biological theoretical point of view is something to keep in mind so this is meant as where do we start from where do we move to what's more difficult to look at i'm not saying anything should go in the garbage right away and never be looked at okay so i've already covered this the key really here is well one on the one hand you will always better off if you do a functional validation with a loss of function variant that the gene product is not there or is there in a reduced amount of course or is there in a truncated form okay so western blots stuff like that uh but what you can already do bioinformatically is look at the percentage of the protein that's affected and if there are multiple isoforms right you may have a stop game that's taking away 50 percent of one isoform but not another and then what are the roles placed by played by the isoform and then you have to do more mining you cannot just know it directly from the databases um well spicing i already keep you at bay out the complexities the annotation that you get from manovar is relatively simplistic and frame shift interestingly not only they can be false positive because in the calling it's more difficult but they may be rescued by another frame shift so if you have your variants you sort them by position and by gene you can see that if a gene has multiple frame shifts one may actually correct the other meaning that you only have a piece of the protein sequence where the frame was incorrect and then that's rescued down the way although if you have genes with a lot of frame shift i mean maybe your two more samples really wild but if there's a lot of frame sheets also in the general reference databases looking at general environments this gene may be under very weak constraint may be less likely to produce a phenotype gene information but okay so we're moving towards the last piece which is how do we assess missense variants so key question is how does a missense out their protein function after it changes an amino acid and biologically before them biopharmatically it's different ideas of how you may look at that what's the type of amino acid change as we said before we go from polar to hydrophobic positive negative charge purling which has its own constraint over the the bending angle and so forth to another amino acid or vice versa how big is the side chain just looking at the size and so forth another key thing is conservation across pieces of the genomic sequence which codes for that amino acid of course you see that the third nucleotide is usually less conserved of course because it doesn't change the amino acid necessary but you see the different positions over the coding sequence at different levels of conservation and examples and then there's also conservation across different protein sequences at the amino acid level which is actually used by the scoring models that i will talk about later so here conservation can have different meanings but the most intuitive one is conservation at the genomic level if you just compare the nucleotide sequences across so many species after doing a multiple alignment as we saw before for the AG side on the splicing side it overlaps a conserved protein domain it's another thing it overlaps a sequence that creates a secondary protein structure you may even have the 3d protein structure and be able to do some sort of extrapolation from that although not throw the proteins you may have annotated functional features from database like swiss plot and then take that into account so these are all biologically what we should be looking at right then we have tools that have a specific model that look at one or more of those properties combine them together either use a theoretical model or how things are supposed to look like in nature or they use a machine learning model that's a big big black box where you use a training set and then the big black box gets internally wired to emulate what it has seen and then you get a prediction output in this course we'll try to put some light into the inner workings of these models with one slide each but you already had a chance maybe to read the paper that was in the background readings on mutation assessor which also gives you a good overview in the introduction of what are the strategies used by these tools well this is the example that we saw before on a very evident change in the amino acid associated with a pathogenic allele and a very moderate change associated with a non pathogenic allele that's also present in the general population and is not somatic or not strictly somatic and well we've already discussed this okay so I was online it was actually in the slide current usage and quickly regulatory sequences such as pricing and I mean the easiest heuristic in the end for these is very strong conservation at the genomic DNA muculotide level right if you see an outstanding conservation that can tip you out and then there's probably more complex models coming out in the future for instance for a splicing answers before we move into the models for scoring missus variants well let's not forget about zygosity right variants come in homozygous ones heterozygous ones you can have loss of heterozygosity in cancer so in fact the heterozygous is on the only copy left of the gene the fact that you can for instance kill both genes on the two homologous chromosomes rather than only kill one means you don't have any gene product at all so the zygosity is not is an orthogonal criterion but seeing a homozygous top of gain or a heterozygous top of gain can make make a lot of difference right or seeing two stop gains at different positions if you can show they are on a different phase so they're queuing the two different copies of the gene can make a lot of difference right sometimes in cancer you may see that there's a previous step that's that's more benign that has one stop gain and then a more a less benign stage which that stop gain has become homozygous or there's another one on the other copy maybe on p53 or rb1 stuff like that so zygosity is a thing not to forget and don't forget that x emails comes on in one copy so outside of the so-called pseudo-autosomal regions screw up the copy you're done okay we are pretty much in time and the last five to ten minutes before the lab we're going to look at conservation and misinspiring scoring models i will sell you a specific conservation scoring model let's go file p and then we arrive through three different models for missus effects scoring mutation assessor was in the paper that was in the readings sift and polyphen two are very very very broadly used okay conservation well we already have plenty of time to discuss conservation as an idea in general it's used a lot in bioinformatics it's a very powerful idea in biology if all the species need something it must be doing something that's the boil down heuristic right of course there may be genes that have those specifically in mammalians or specifically in humans many of those actually control the as you may imagine the brain architecture and neuron dynamics fortunately for cancer we have so many genes that are unique to humans so you can solidly trust multi species conservation we can look at conservation at the nucleotide on the level on the genomic sequence we can look at it comparing protein sequences even if the proteins don't come from the same genomic locus in this section i will focus on the one at the genomic level so aligning different genomic sequences so that their nucleotides overlap and if you look at ucsc University of California Santa Cruz browser and all the backend data and you're interested in the nucleotide level conservation there's the phylope score which scores every single position for conservation looking at substitution rates and using a model to decide if that's close to the neutral substitution rate or if that residue is conserved and then there's the fast cones score which is what well it's done by writing through these phylope scores across the genome and then checking if you're seeing a high concentration of high phylope scores and then it's telling you oh not only the single residues are conserved this interval overall is conserved and this is not based on some silly just averaging by a fixed window it's based on a hidden mark of mold for people who are with a computer science background so it's sophisticated enough we're not really going to use this because this is more useful when you write out of the protein coding genes it's useful in telling you is this sequence not just single position but sequence conserved is either conserved element maybe regulatory or doing something here since we focus on protein coding sequence we're really going to use the phylope so at the single residue to assess misense violence and then all is useful to take a look at the multi-species alignment available in UCSC so for instance we did it before when we looked at the intronic AG that controls splicing this is just a screenshot showing you those tracks the phylope is available for vertebrate mammalian and primates cancer you can look at the vertebrate maybe the mammalian you don't really need to just look at the primates if you are doing research on schizophrenia or autism sometimes looking at the primates is interesting and then you also have the fast cones core you can see that the fast cones doesn't come in rather coasters as tight as the phylope but usually comes in big blocks that are either cold or not cold meaning that you have elements that are conserved or so intervals that are conserved or not conserved right so you can imagine the hidden mark of mold running through the single phylope scores and then deciding if it's seeing a stretch that's conserved or not conserved but you can see a relation that where you have less density of a high phylope like here with more dips the fast cones drop and when you have a stretch of phylope that's high fast cone rises you may wonder why you have that repeating pattern where one of the three phylope is low that's because of caudons the third nucleotide can be changed without changing the amino acid you see that in the phylope score it's all over the place okay so i've already briefly sketched what the phylope does and the score comes from a statistical test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift so you have a model in which something is not constrained can just change randomly without producing an effect which is the neutral drift and then if something is changing faster than that throughout the phylogeny that's accelerated evolution and if it's constrained it's changing less than expected and that's conservation so giving you a very simple and boiled down to to the minimum explanation probably a quantitative genetics person but that's the idea to bring home here of course we're not interested in things that have accelerated evolution we're interested in conserved things as a proxy of that position is really important cannot widely change and nothing changes and it's available only where align sequences are available it's usually not a big deal for protein coding axons maybe a deal when you move out in the genome that you may have stretches that are unique to humans or only human and chimp and as you see it on the ucsc browser conserved is usually greater than two or you can be less stringent and take greater than 1.5 or you can be more stringent and take greater than 2.5 i don't know in the ANOVA database that we usually get a score that's slightly different but i have details on how to back convert that into these scales this is the master scale it's actually coming from a transformate of the p-value of the statistical test and zero is neutral and the negative is diverge so cedar not under conservation it has diverge a lot might be accelerated evolution but here really interesting the fact that if you have a change there is it's very little likelihood to be interesting as a driver right moving towards the end and i've already given you an overview of this other models scoring missons variants the key things to keep in mind is what features they use and then do they use a theoretical model or do they use machine learning and finally what data set is used to assess their performance or if it's a machine learning model to train them because you can imagine you have so many features how do you decide which missons are actionable or not if you're using sets of non-missons that produce an alteration there are many ways to define that if you use variants that you think don't produce an alteration there are many ways of defining that so there's really a lot of interpreting biologically what choices were made constructing the tool that you have to keep in mind the most the tool that's been around for the longest time that it's become most popular it's called SIFT it was published first in 2001 so it's more than 10 years ago which in genomics is almost pretty historic it's designed specifically for with the laterious mutation in mind so mutation that degrades the capacity of the protein to function it's not really designed for instance for activating mutation which are as interesting in cancer that doesn't mean that it will do a disaster of a job in for activating mutation it just may work less well and the idea is one and relatively simple the idea is you take the protein sequence you take similar protein sequences using a cyblast and then you come up with a score which is similar which uses a framework similar to the Blossom matrices position specific substitution matrices the final probability to see a given amino acid in that panel of protein sequences that are related normalize that and then you get a score that if you see a given amino acid that's different than the the typical one in the protein how likely that is to be disruptive so it's just taking so many protein sequences that are similar aligning them looking at how many times the alternative amino acid is observed maybe zero times in that case you just have the so-called pseudocounts which is a minimal background count and then based on that it's saying this is potentially damaging so it's just looking in nature for the other protein sequences how they are constructed if you always have a given sequence of amino acids in a protein domain and a protein domain is relatively well conserved and you have so many proteins and then all of a sudden puff instead of a glycine you get a problem this should be good attending you that that's not going to work out right but you need to have of course so many protein alignments and then there's the cutoff that people usually use that also proposes default if you're below that something is supposed to be damaging but really this is using one idea taking protein sequences aligning them using side blast and then looking for how many times you'll serve an alternative amino acid using that specific probability model now I don't remember if they use all vertebrates or even more but I mean side blasts at some point will not take in sequences that are too diverged yeah and there is a 2009 yeah no no no I mean I didn't mean to say that they publish in 2001 and then they stopped doing anything but you can see that the idea is simple and it came out a long time ago and then it was not revolutionized from the ground up you can see more recent tools bringing in a lot more stuff right because there's publications stratify and that's polyphen 2 basically has many more features in fact I didn't even list them sequence based structure based and this is a typical machine learning model so it's using all these features some of them may be more informative some of them less informative and then it's using two different training sets now I'll just spend a minute on this in this one that's more stringent you use damaging alleles non-manilian disorders and then you use protein alignments from human to other mammalian homologs as a negative this is less stringent because you use all human disease causing mutations and non synonymous snips without disease association so that starts to give you an idea of the kind of choices choices that you have to make in the mutation assess assessor paper what they really point out the big deal with this machine learning based models is if most of your variants disease associated come from Mendelian disorders where loss of function miscellaneous are more common or certain types of miscellaneous are more common different than cancer then the performance for cancer may be less good right so it's worth using polyphen also for cancer but it may not be the best tool ever and then the last one so we keep in time is mutation assessor which uses again an evolutionary idea that's similar to sift but extends it taking into account that you don't only look at conservation across a set of sequences that are similar but you take into account if a subset of those sequences that form a family actually have a very high conservation so it's basically improving a bit over the sift model but again the idea is strongly based on amino acid conservation within protein sequence and then it's not using machine learning it's just using a theoretical probabilistic model based on entropy what's interesting is that they benchmarked it again is polymorphisms and then somatic mutations present in cosmic now what's really I find striking is that these different curves this is a polymorphic one which is basically telling you that if you say you're here with a score of one that you get about 50 percent of the polymorphisms and if you take four you got basically nothing you can see the shifts in this curve meaning that at high values of the score you get a lot of them this is if you increase the cutoff on how many times a somatic mutation needs to be absorbing cosmic and this is five or more times so as you become more stringent on this is being reported for multiple samples multiple studies it's more likely to be true just using a frequency true and relevant using a frequency based creativium nothing else you can see that mutation assessor does a better job right so that's it for the front lectures and we only have a minute's sake