 All right. How about now? Any better? The same all the time. All right. So I'll try and project and speak loudly so that you can hear what I'm saying. Please, if you can't hear Wave say something in case you can't hear me. So I'm Jamie Tier. I'm a postdoctoral fellow working with both Les Beesiker and Jim Mulligan. And I work on things, technology development aspects, including informatics, intermediate analyses, as well as more end-user analyses and sort of trying to determine which variants could be biologically interesting. And so today I'll sort of talk to you about what we're doing with variant annotation and introduce the topic of viewing some of the exome sequencing data. And like Jim, I'll talk about many different software tools available and tell you about some of the ones we're using, but that doesn't imply any type of endorsement or anything like that. So you heard from Jim about the sequence generation, as well as the alignment and the calling of genotypes. And this will often be done by the sequence provider for you. Certainly alignment and calling of genotypes can be complex and that can be further refined and often done by informatics teams. And today what I'll talk about are the next steps. First, annotation of the variants. And then I'll introduce analysis of these variants and some tools you can use to look at your data. These steps tend to require more informatics experience, but our hope is that these steps can be done by the end-users so we can really just look at your own data. All right, so I'll frame these general considerations as sort of the who, what, where questions. We'll start with where are the reads aligned. And I'll briefly talk about some of the programs to align, to view the alignments. And there is value to looking at the more raw data. And I'll tell you a little bit about that. What is the effect of our variants? We'll talk about annotation and consequence programs to determine the context and then what perhaps predicted detriment your variant might have. Who else has the variant? So we'll talk about some variant databases, looking at populations and sequencing data across many different populations. And finally, how can this all be done? I'll talk a little bit about pipeline software to help pull all these tools together and generate as a pipeline a final set of data. And then sort of a brief shift, we'll talk about how to identify important variants and I'll introduce a tool we've written to allow end-users access to the data. And so down here, these tools that I'll talk about require varying degrees of computational and informatics expertise. And so we'll start with or there will be a whole range of tools. And some of these are easier to use and require less informatics experience. And this generally means a graphical interface, so with buttons and lists and things. And so those tools will be represented by the picture of the laptop here. On the other end of the spectrum are tools that are more challenging to use. They require much more experience, perhaps knowledge of UNIX and Linux. And these tools are almost always command line driven, so writing, typing things out in the command line. And these tools will be represented by a little picture of a server. Okay, so let's start with where are the reads. Are the reads aligned correctly? What does the environment look like? Jim showed you a picture in his eLand slide where the alignment looked really messy and that would give you a hint that perhaps something might be going wrong in that region. And so therefore, there's value in looking at your raw data. So the basic low level approach is to look at the exact file formats. One of the file formats we use, it's becoming quite common, is the SAM BAM format. This file will contain 20 to 100 million alignments or reads per sample. So there's quite a lot of data. And there are a variety of programs available to view and manipulate this data. They're mostly command line driven and many of them are actually libraries for programming languages so that you really do need to write your own programs to use these. I know you can't see the data. It's not important exactly what's going on. But here we do have reads. We have sequences encoded, I'm sorry, the quality scores encoded in a special way and then alignments. And clearly looking at this is perhaps not really terribly informative. So you're really just getting lists of reads and it's not so useful. So some of the tools that are more useful, I apologize if this doesn't show up so well. This is an example of a program as part of SAM tools called TVU. This is something we actually use quite a bit, mainly because it's very, very fast. It's a text-based viewer and so you have dots and commas illustrating forward and reverse alignment reads that are the same as the reference. And here you see when things are different from the reference, you actually see the base. And so similar to the slide Jim showed when you have both non-reference and reference bases, you can say that that is a heterozygous variant. We have one over here as well. There are some colors here that give you a sense of the quality just as a quick look. And other things that you can get from looking at this, in this particular case, you'll notice that when there is a variant over here, there is not a variant on the same read over here. So these variants are actually on different chromosomes. So this program is very fast. It is text-based and that adds to its speed. However, the text-based nature also really affects its functionality. It has very basic functionality and so it's, in our case, is really a go-to tool to spot check variation in regions. Okay, so a different tool, a little more user-friendly, is the UCSC Genome Browser. I'm sure many of you are familiar with this browser. It does now accept BAM files and shown here are the alignments. You can see lots of different reads. Again, the two positions I showed you previously. The view is still a little bit dense though. And so really the UCSC browser allows you to view your data together with the UCSC track, so that's quite powerful. You do need a public-facing server to hold your data. So what happens is the UCSC browser looks at your data to see what it needs and so that can take a little bit of expertise to set up. The viewing options are a little more limited. You can't generally change too much of the views, but it is still a powerful approach. Just the final tool for visualization I'll talk about is called IGV. This is the Integrative Genomics Viewer from the Brode. And I should point out that if you've downloaded the presentation, the last three slides have links to all these tools. And if you can't even see those slides, you can just search for the name of the tool and then add genomics to it and you can generally find the tool. Okay, so IGV, similar sort of view to UCSC. So you have, again, your Cytoband, your chromosome location. Down here you have gene models so you can see where your variants and reads might align compared to genes. Again, here are the two positions, the substitutions we were looking at before, the heterozygous positions. The nice thing about IGV is you can zoom in very closely to your data, zoom out. You can highlight the reads and see quality. Here they have a nice sort of histogram of the depth that you're looking at. And so this is a very powerful tool. It does, as I said, allow zooming. You can highlight reads to get more info. It has many features and really it was designed to be quite integrative. And so it takes many different kinds of sort of genomics data. So I encourage you to take a look at this one. And it does have a web launcher in addition to being able to run locally. So it seems to be quite easy to use. Okay, so now we'll get more into the meat of annotation. You've been given a list of many, many variants. What do you do with it? What information is there? Are they in a gene? Are they coding? Could it be a detrimental change? And so these are the questions we seek to answer. And so when you get your list of variants, you're really getting just that, a chromosome, a position and the change. And by itself, there's not a whole lot of information there, especially for biological potential function. And so the annotation really provides context in what you're getting then, all right, we have this position. Well, now we can see by annotation that it's in a gene. It's in this particular AKT1 gene. It's coding. It's in an exon here. A change would cause an amino acid substitution. So that's informative. Down here we could see that it's highly conserved. That's quite interesting. It's a known snip. If you would dig into the literature, you would be able to identify that this position is associated with various diseases. And in that way, you've gained a lot of information about your variant. And so the first step, basic annotation, is the goal to determine variant context. All of these tools will do the basics of identifying if your variant falls within a gene. It'll tell you if it's coding, what amino acid position it's in, and then what the actual amino acid could change could be if that is a non-sononymous variant. And so each tool then offers a little bit extra. ANOVAR here offers exonic splicing, common variant formats, intergenic descriptions and things like that. PNOCD PRED offers a conserved domain prediction, which I'll talk about in just a second. Seattle Seek annotation offers really a broad variety of annotation tools. So many different data sets you can compare your variants to to see what could be going on. Really, I think this one has the most features. And then this one down here, SNPF integrates with GATK, an analysis pipeline that's written by the bird, as well as Galaxy, which I'll talk a little bit more later. And this one can read and write a more common format, a variant call format, to actually describe your variants and their annotations. So some of these programs run locally, ANOVAR and SNPF run locally on local data and are therefore very, very fast. PNOCD PRED uses local scripts, so the data is local, but it accesses a server, an external server for the annotation information. And this one here is completely external, so you have to upload all of your data to their server. And that's something to be aware of. This particular tool's terms of service says that it is possible that someday the data may be released. And so if you have private data, you need to be aware of where you're actually putting it and what that might entail. So now we've annotated, we have some idea of what the variant is. Now we'd perhaps like to know about consequence. How detrimental is a variant? I'll point out that most of these tools are really amino acid centric, but it certainly is more and more information about the genome is becoming available. I definitely hope to be able to see tools that go beyond amino acids and talk about, be able to predict consequence in other regions. So we'll start with SIFT. SIFT uses degrees of conservation among the proteins to predict a detrimental effect. Polyphen2 uses a variety of features including sequence, conservation, and structure. And if there's a known structure model, this tool takes that actually into account to determine if a variant could be causing a biological effect. CDPRED uses conserved domains database. And so if a variant changes away from ancestral position within a conserved domain, that would be predicted to be detrimental. Down here, this is sort of a different type of tool, not really a prediction tool. The human gene mutation database is really a curation of literature and of locus specific databases that's constantly updated. I believe you'll hear a lot more about this later, and so I really wanted to mention that although it is subscription based, NIH now has a license for this, and so if you go to the website again on the slides, it gives you the instructions on how to set up an account and be able to access that data, those data. And so again, these tools can be very powerful and can help guide and analysis, but they're not perfectly predictive, and so it should always be carefully considered the information that you receive from them and not just sort of blindly use it to determine what your most important variants could be. So at NISC, this is just an example of the types of annotations we do and what the data actually look like. Here we were using ANOVAR, CDPRED, and HGMD, and this is just a screenshot from our program of our sifter that I'll talk about a little later. So every row is a different variant, and over here we have variant types, so some of these are intronic, we have synonymous and non-synonymous single nucleotide variants, and down here intergenic regions. Gene names are here, so what gene does the variant fall in? And here is consequence, so in this case we get the gene name, a transcript ID, the exon number, in the cDNA, what position is being changed, and then the protein. So here we have a synonymous t to t, but here a non-synonymous methionine-tilacine. Over here some of the prediction, CDPRED score, this particular predictive tool, when the score is lower, the variant is predicted to be more detrimental, so of course the synonymous ones aren't really predicted to cause any difference, but some of these numbers are getting down quite low. And then if you look at HGMD now, we have diseases associated with these variants, so the example is the CFTR gene, and of course this is associated with cystic fibrosis, and some of these detrimental variants are identified as being associated with the disease. And so looking at all this information together you can begin to see how you might start to sort and filter the data and identify positions that might be biologically relevant, particularly to a disease or a phenotype of interest. Okay, so now we've talked about where the variant might be and what consequence it might have biologically. So now I'd like to talk about who else has the variant. So is the variant common in a population? For example, if you were studying a rare dominant disorder and your variant is common in another population, that variant's probably not causing the disorder or else all those people would have it as well. So is it seen in other populations, in certain populations, and has it been observed in a disease cohort? So there are many studies now that are looking at certain disease cohorts and this information can all be helpful when deciding if your variant is important. And so here are some of the human variation databases that are available. Definitely this list will grow, I believe, in the next few years. We'll start with DBSNP. It should be pointed out that DBSNP really includes everything. It is not just a database of common variation, but it can include variants associated with disease, really everything. The SNPs do have information about origin, so you can identify the project in which they might have been identified, and when the projects are larger, even get frequency information. So how common was it in a particular population? You may have heard of the Thousand Genomes Project. This is really a project designed to sequence a lot of sequence for a large number of populations across the globe. Currently, the current dataset has more than 1,000 low coverage genomes, and low coverage whole genomes tend to limit sensitivity. So the most rare variants might not be detected, and they say, though, that their sensitivity is 1% or higher, so it's pretty good, but the rarest of the rare variants won't be in that dataset. To address that, however, they have sequenced a large number of exomes, and this data should be coming soon and will be high coverage, and therefore will have the power to detect the rarest variants. The ClinSeq Project, you've heard a little bit about. Currently, we have data for 650 exome with a plan to increase that to 1,500. These individuals have extensive phenotypes, and the phenotypes will be deposited in dBGaP, and soon we'll be releasing data from these individuals to dBSNP as well. Another project that's slightly newer is the NHLBI Exome Sequencing Project. Currently, they have 2,500 exomes with phenotype, again, to be deposited into dBGaP, and this is in dBSNP, and a VCF will be available. So I've given this a big server and a little laptop icon. Most of these datasets come as flat text files or some other format and require informatics tools to access the data. However, some of them do have web interfaces, and this will be true for some of the other tools I talk about. The web interfaces are, of course, easy to use, but generally only allow you to check one variant at a time, and the facilities to check hundreds or thousands of variants aren't as easy to use. So these can certainly be used on a web-based way, but that might not be as useful as really getting into the command line to examine thousands or hundreds of thousands of variants. And so just to highlight a personal story in our group for dBSNP, we had identified in the search for a cause of Proteus Syndrome a variant in the AKT1 gene. This was a non-synonymous variant. Again, this was the one I showed before, so highly conserved, but there was a known SNP. So we looked at the SNP carefully and found that that SNP is actually a deletion, where our change was a substitution. Okay, we're fine. We don't have to worry about that. However, in the latest build of dBSNP, there actually is a new SNP that is the exact substitution we observed. And so had we blindly filtered all dBSNP positions, we would have missed the variant that we have now been able to associate with Proteus Syndrome. And so certainly dBSNP is very powerful. It's a great resource, but you just have to be careful when using it because it is not exclusively common polymorphisms. So I've told you about a large number of tools, and now the question is how can these things be run? I mean, is there a graphical interface? How can we really tie it all together to go from the calls down to annotated data? And so I'll briefly discuss what we're doing at NISC. I'll start by saying that this is a bioinformatics experience heavy approach. So we use the SunGrid engine to work with our cluster. We write scripts in Perl and use a variety of Linux and Unix tools for this to run smoothly. And even then it does require a large degree of experience in hand-holding to watch the data and make sure things run correctly. So we do start with a sample genotypes file. This is basically the genotypes as Jim described for one sample. And the first thing we do is we identify the variants in each of our samples and make a list of variants. And then we take the positions in this list of variants and go back and determine the genotypes from all different files at that position. And really the point is to say at this position in this sample we have a non-reference position or we don't have non-reference because we have a reference position or we don't have non-reference because we don't have enough data. And being able to say that it's not referenced because we know it is, I'm sorry, being able to say it's not a variant because we know that it is reference is very powerful. We can then determine frequencies and actually say with certainty that this variant is rare in our sample set or in our population. And so we call this process back genotyping and it takes a little bit of computational time to do but we find it to be quite valuable. So now we have a file of back genotypes all the variant positions and then all the genotypes for each sample reference or non-reference. And now we undertake the annotation process. So again we take this data and calculate annotation based on the three programs I talked about. Then we merge it back with the file. We calculate frequencies in our populations and get them from other databases and merge that back with the file. And at the end we end up with an output genotype file that's currently in a structured text, just a tabbed limited text file. But in the future we also hope to offer the more common file format that's being accepted more and more the variant call format file. Okay so that's great if you have a lot of experience. A lot of the big sequencing centers do sort of roll their own pipeline. They create their own. But is there another way? So a team has developed a very cool piece of software called Galaxy. This is a web-based software. It can be run either locally on your own servers or through a publicly accessible web page I believe at University of Penn. And the basic idea of Galaxy is to be able to allow investigators to use a wide variety of tools for DNA sequence analysis in a graphical user way. And so the idea of this is that you load your data over here. You then can choose of many different annotation or really analysis programs including in this case amino acid changes. Here is sift that I talked about along with really a wide variety of others. So you then pick one of the tools you want to use. You'll select the data set that you loaded. Select some other options and then click execute. And this will submit the job to run in the background on the Galaxy servers and eventually return data here. And it's nice this will give you a history of the tools you've applied to your data. And you can even tie together different tools to make what they call a workflow but it really is a pipeline that can be run over and over again for different files. Galaxy is also quite nice. They've spent some time integrating it with the Amazon cloud. And so as Jim told you this really takes a lot of computational hardware to do these types of analysis. The fact that sequencing costs are beating more as law is actually very very frightening to folks like me. So however the integration with the Amazon cloud means that you don't necessarily have to buy all this hardware and maintain it. You can set up an account with Amazon and basically rent their servers. This then basically is installed for you and you can just access it where at your own cluster. And that's a very powerful approach. And finally I should say that the team behind Galaxy has really taken made an effort to make this easy to use. So there's a variety of tutorials and how tos and even videos describing how to use the tool to make it as user-friendly as possible. Just to show you some of the other features it does have next generation sequencing tools including a variety of tools to help you deal with SAM and BAM alignment files. And additionally a lot of tools for text manipulation. So I've sort of hinted that a lot of the data that comes out of bioinformatics analysis is textual in nature. And so in many cases you'd want to sort of change columns around or reformat the data output from one program in order to be able to input it in another program and Galaxy offers you some ways to do that as well. Okay so I've described now how variants are annotated so at this point hopefully you will have a list of variants the genotypes as well as a large number of annotations with which to determine biological importance. So now how do you identify the important variants? There's several approaches one approach is really user independent so it doesn't take preconceived ideas or notions and uses the data itself to allow important variants to arise. The other approach being just the opposite where it says you as the user have generated a career's worth of information so use that information in determining what variants could be biologically relevant. And of course are the tools easy to use. So again we'll start very low level with file formats. The variant call format used by the thousand genomes project is a text format. There are ways to zip this to make it a much smaller format and to make it very fast to use and there's some tools to view it but again these tools are generally libraries for programming languages. And you can see the details aren't important but there's a lot of information here it's quite dense and looking at this by eye isn't really terribly useful. A structured text file is something that we often work with. This would contain a header line telling you what's in the file and then just columns separated in some standard way including annotations and samples and different types of information. And the advantage of this type of file is it can be loaded into spreadsheet programs like Excel. People have certainly done that successfully but we have found that there are some features that would be wanting and using Excel so that's there are other approaches. And at this point as Jim said for each sample you're looking at 70 to 150 thousand variants each sample you add adds more variants so again we are talking about a large number of data certainly not you can start biological experiments on each variant and so how can you reduce this number to something that's much more manageable. So the first kinds of tools are variant prioritization tools that are run ab initio so this does not require any user input. One tool is called Vast. This one prioritizes variants using a probabilistic approach. They're basically using case control GWAS style statistics but are also including more information so they use amino acid substitution so this idea of how detrimental and amino acid change might be. They use aggregation and this is a neat idea where you look at a region of a genome in the case of Vast it's a gene and they combine all the variants in that gene together and consider it as one entity and so for rare variants this helps you account for any kind of rare variants in a gene to offer more power and this tool can also use inheritance information. The tool is free for academic research use but do check that license to make sure it applies. VARMD is a tool you'll be hearing about later and this tool prioritizes variants using inheritance patterns so if you're studying a certain type of inheritance this tool will allow you to identify those variants that fit your models. Again I won't say too much about this but it is available on Helix in the Galaxy development section so it's available for NIH folks. And so finally today I'll talk to you about a tool that we've developed to really help users with limited informatics experience to get their hands dirty with this vast amount of data and and really try and identify what could be biologically relevant. So the tool is called Varsifter and allows for the viewing, sorting, and filtering of your variants. It's available inside the NIH at this website here. And so basically what we're looking at is a view of annotations. Each row in this view is a different variant. You have your chromosome and position location information and then all the annotations I've previously described so gene name, mutation type, database ID, and things like that. Each of these columns you can click to sort your data so you can prioritize sort by gene name really by anything that's in this in this table. Clicking on any row will show you all the genotypes for your samples in this window. Even if you have hundreds of samples they're all displayed these are all sortable so you can sort by genotype and see who perhaps has the actual variant and then coverage and genotype score quality information are visible as well. And over on this side you have all the filters with which you can reduce the number of variants under consideration. And so I'll just walk you through an example, several examples using this tool. First I'd like to impress that in this file we have 76,000 variants. This is really just a test file. There are two samples but it's really only one sample that have copied and made a single artificial change. So presumably the amount of data you have will be much higher than this. All right and yet 76,000 isn't really a small number so how can you begin to get down to those variants that are most interesting? So let's start by filtering on mutation type. This is a list in the upper corner that was generated from all the different types in your data file. For fun we'll just click on the most potentially the most detrimental. So we would click on splice site and then click on stop boxes here. And now down here is the apply filter button. This is an important button and really applies all the filters that you selected. So multiple filters can be chosen in any combination and then applying them will filter the data. So in this case we've applied the filter. We have only stops and splice sites. Again you can be sorting any of these columns to really look at and look into what is going on here. And we've reduced the number down to 133. That wasn't so bad. But 133 is still a lot so how can you reduce it further? So if we look here we notice that there are a lot of variants that have been previously described in a population database. Given the caveats I've mentioned before let's just see what happens if we eliminate those things that have been recorded in this database already. So here we have an exclude box and we'll check exclude database ID, apply filter and now our list is down to 25 variants. So now these are really getting into numbers that are handleable, can do some literature searching and even think about biological experiments. So again sorting and filtering and let's pretend now that none of these were interesting so this filter failed. What can we do next? So now we just clear and back out into our original list and start again. And this is really the idea of the tool to allow you to design filters to dig in deeper and deeper to smaller and smaller sets of data and examine the data and then if nothing is apparently interesting to back out a little bit, back out a lot and really just to work diving in and backing out with the data to allow you to see what's going on. So now let's say you've got some information about your samples. You have affected normal pairs this would be the type of thing you would look at if you had a tumor normal pair where the samples are the same except for some very important differences that you're interested in. So here we have an affected normal pair. We'll click the box that says affected are different from normals and at least this many pairs. If you have multiple pairs you can determine how many you're interested in in dial sensitivity, tighter or yeah dial in your sensitivity. So now we'll click the apply filter button and here is the single artificial change I've introduced in this practice data set. So that was great but can you define this yourself? Does somebody have to do it for you and you can define it yourself so if you would to click on file and sample settings you would see a window like this that lists your samples and then the status for each sample and these are boxes you can check to determine. So affected normal samples have to come in pairs but cases and controls are more flexible as you can identify any number of cases or any number of controls but having at least one of each is required. All right so now we will unclick affected normal, click case and control and hit okay. And now down here a new filter is available the case control filter and this basically says I want to see variants that are, I want to see positions that are variants in X or more cases but Y or fewer controls and these numbers can be dialed in to dial in sensitivity as you would like and so in this case it will identify the same exact artificial variant I've introduced. What's next filtering by gene name? I'm sure most of you have favorite genes that you'd be interested in seeing the variants in. So down here the search gene names for box you just type in the gene name and here we have in our example CFTR these are all the variants that lie in CFTR. Now of course if there was a gene called CFTRQ then that would show up as well because it does match CFTR and so using this box you can use a syntax called regular expressions which is a very powerful text searching method. If you don't know what that is I urge you to check it out this evening because it's really cool and it really allows you power in searching and that supported here as well. So what other types of things can you do? Say you don't have just one interesting gene you have a whole list you would like to know about genes in a pathway or in a gene family. You can create a text file one gene name per line load that into the program and then identify those variants that fall within those genes. Similarly if you have a list of genes that you're really not at all interested in you know they have false positives you just don't want to look at them you can load again the same gene file and then exclude those genes from consideration. Let's say now that you have regions of interest you've want to look in regions near GWAS peaks or you've got linkage regions or perhaps a chromosomal region that could be interesting. These regions can be loaded as a bed file which is really a chromosome start and format and then filter on that and the program will show you variants that fall within those regions of interest. So those are the options that really come standard with the program but say there's something else you want to do perhaps you have a VCF file unfortunately the VCF file standard does not specify a column for the gene mutation type so the program won't know about it but using this custom query filter you can still filter on columns like that the program is not aware of. And so really what we're looking at here the same part of varsifter this is a central window that shows all the queries you've designed and the logic of how they link together over here we have sample options and over here annotation options. And so just to walk you through how this would work we would say view custom query and now we have a blank slate. So we'll start with an annotation query let's go along with this example of a type of annotation that the program does not know about. So we'll click on type up here and you'll notice that then it this list becomes populated with information and these buttons are active so we've chosen a type and now we have to choose an actions we'll say exactly matches and now we have to choose something that it exactly matches. And so in this list this will contain all the values that were observed in your file in that column so you can select any one of these values or type in some search text here again using the regular expression syntax. So in this case we will click stop and now we have a query block here type equals stop so that's one query. Okay let's add a sample query so we had clicked affected and now we have action buttons available this time we'll choose does not match and now we can click either a genotype homozygous reference homozygous variance or a different sample. And so in this case we had clicked affected does not match and now we'll click normal and so now we have a new block affected does not equal normal and this sample filter is powerful so you can use it to define your own sort of Mendelian inheritance filtering. For example a simple example would be parent one is equal to heterozygous parent two is equal to heterozygous affected offspring is equal to homozygous non-reference that filter set would then allow you to identify homozygous recessive variance. So now we have to link these two things together if we draw a box around them or shift click them now they are highlighted and down here we have logical statements and and or being very common so a and b or a or b and so we'll click and and now we have a statement that says affected does not equal normal and type equals stop and so you have to finalize the query and this is just like any other query so you check the box and you can combine it with any of the other filters I showed you on the main page and in this case where we to apply this query we will get an empty list back because the single change I introduced was not a stop variant so you would back out again and start a different search and so that I'll close I hope I've given you some context about a giving context so we start with annotation to give more information about the variance that have been identified a confidence prediction can guide the analysis but of course is really just a guide not really an absolute truth now the tools I've showed really do require varying experience there is more of an effort to make these easier to use but certainly some personal experience or a collaboration with those that do have more experience it can be a valuable thing to run these very powerful tools and so then once you have annotation information you can either use prioritization tools that require or that return black box but really information that is not preconceived so this is user independent type information or you can do user guided analysis to use your own knowledge to determine what variants could be the most interesting and finally I showed you a tool that we've written to be to make this hands-on analysis approach easy to use and hopefully that will be helpful for you and with that I'd like to close and say thank you very much