 So, so thank you. I will give you a warning at the start. I'm not a clinician. I don't pretend to be a clinician I'm married to a clinician. She yells at me when I say things that are completely and utterly ridiculous And I honestly hope that nothing that I say today will deserve that treatment Yes Good good, that's that's important. All right. Is that better is it turned on? Okay Right, so I I'm going to give my very simple thought simplified view of what we have here The dramatically simplified clinical workflow and then what I'm going to focus on so the first part identify variants It's of course technically easy and getting easier Use what we already know to make sense of them. This is about the best we can do right now And what we already know includes previous studies Algorithmic things basically use what we know to make sense of them and then do something about it This is the part that I'm definitely not involved with but this center part use what we already know to make some sense of them is Where I think the database integration and the types of resources we have at the EBI can be useful So to put this into a little bit Finer point to go from research understanding of human variation to using it in more standard medical practice I think we need a few things and this is again My perspective from from sitting on top of a number of very large databases We need consistent traceable data generation and analysis routines. We simply have to know what we have done Robust annotation based on public information sources such as those that we have at the EBI or the NCBI I Think it's probably true that about 95% of all the information that could be used to understand and interpret human variation is Already in the public domain. This is existing information obviously as we generate more this number will go down I've been saying this for a while. Nobody challenges this this percentage. So I just keep saying it and And then of course we need to report some of this information into medical records so that it can be used So I'm going to start I'm going to talk about two things In the under this concept of database integration One is the idea of continually updating and update the existing information to ensure that it's accurate and comprehensive And I'm just going to start on some of the ways that we do this in the research setting We have a database at the EBI. It's called the European genome phenome archive This is a secure database. It stores the results of data generated from research into molecular medicine This is what the home page looks like we organize the data by study by data sets by data providers This turns out to be useful distinctions as very large data sets are often used in multiple studies and people who create data sets often create multiple data sets Just rather than look at the home page. Here's some text about it. The EGA is a for secure storage We provide authorized access to all types of data sets that get generated in the context of of research into Molecular medicine DNA sequence genotypes transcriptomics phenotype data We currently have data associated with GWAS studies cancer studies The human epigenome consortium UK 10k and many other smaller scale projects We are the peer archive so to speak of the DB gap database at the NCBI For legal reasons and restrictions. We don't exchange the data that's in the databases between the EGA and And DB gap, but we have exchanged metadata So if you go to the EGA and search for a study that's available in DB gap You'll actually find it and you'll find that it's available in DB gap and vice versa If you go to DB gap and search for something like the welcome trust case control consortium You'll find that we have it and a link to how to get it so besides storing that sort of data this data that's generated in the context of Doing research into molecular medicine. We are accumulating a tremendous amount of data on the human genome sequence to understand what it's doing These include projects that the NHGRI funds like in code, but they're coming from all sorts of different projects We integrate this together Into the ensemble genome browser. It's how we provide information on the human genome and annotation We think the types of things that we have there are useful for understanding what the genome is doing This is just some example of some of the things that we have from Alignments where we have regions of evolutionary constraint to genes and gene families and so on Within ensemble we've pulled in a lot of different data sources That are human variation data not only do we bring in the polymorphism data from DB snip And try to separate that into things that are coming from thousand genomes or other places We've been bringing in locust specific data from locust specific databases Using a standard sequence that we've developed and continue to develop in collaboration with the NCBI and other people We have structural polymorphism data mutation data from cosmic and htmd as well as phenotype data from a number of sources and the types of raw materials people use in their research So this just gives some numbers right now. There's you know, as you know, there's around 20,000 20,000 genes we have the information on somewhere around a hundred thousand variants some of these of course overlap And all of this data is is there we update it continuously with each release This is still a tiny fraction of the variants that are available in the genome and that number is now touching something like 45 million Going from I mentioned this a little bit already going from lsdb's or even diagnostic labs to the type of central resources We've worked to try to break down the informatics barriers to doing this breaking down the informatics barriers does not solve the problems here There are many other barriers including the ease at which one can do this Aspects of intellectual property whether or not people are willing to share But we have worked to bring down the informatics barriers And and like I said, we have done this in collaboration with NCBI and with other groups So part two Provide a method to search the relevant resources using variants or eventually hold genomes as inputs So we can collect data together and in fact we've already heard this morning that people are collecting data together to make their own interpretations Searching through this and searching through the most up-to-date thing Is something that will continue to be a problem? Google is a valuable resource searching Google with a chromosome coordinate and the letter a it doesn't return anything useful At least not usually One of the ways we've done this is through something that we called the variant effect prediction tool or the VAP What this does is takes in individual sites of variant variation in the genome and does a number of things One it calculates the effect of the snips in the context of the ensemble genes and regulatory features We provide this with a web interface with a more standard computer interface that people could put into a pipeline We've put it back towards the previous version of the human genome We're working right now. It doesn't do structural variants very effectively But we're working to increase that functionality especially in the context of ICGC And bringing in information that we find in projects like encode to understand when the variants are disrupting known transcription factor binding sites We've created the ability to run this without connecting to the internet. So it doesn't have to break security and we have We've now written the code in our in the process of testing it for user to find plug-ins So people can do all sorts of very creative things And finally we we plan to start in 2012 just returning the answer as to whether a variant is seen in an EGA data set So within behind the protected wall and then tell people how to apply to get more information about that So this is effectively a variant-based search of VBI's data resources And we would like to make this even more comprehensive as a variant-based search of VBI's data resources So I'm going to give you some examples on how this works in practice just to show you what I mean At present we bring in all the information that we have in various ensemble databases That in and of itself is a hub of information that we pull in from a lot of other EBI resources And we do this in a way that we've we've written a whole host of code now It's very modularized and provide the results through the web browser This is an example of the the way the web interface works right now the types of things we return Because we could do this it runs generally across all species Human is where it's used the most, but we actually have a lot of people doing farm animal research that like to do this And of course, there's a lot of profit involved in that We support multiple data formats And Importantly and I'll come on to this a little bit. We return Information using standardized terms so that everyone ideally understands what we're talking about We provide Co-location with existing variants so whether or not it already exists in DB snip is this a thousand genomes variant and so on We can return information in standard nomenclature like hgvs We have pre-calculated sift polyphen and condel across the entire genome for every possible amino acid substitution Although I think there are questions as to the validity of this. I can tell you from the usage standpoint This is incredibly popular people use this constantly And the ability to filter against hat map or thousand genomes frequency data as well So this is the way the output comes. It's really a very simple form It actually goes quite a long way across the page. You can see it's kind of blurred out on the side I think I can show you sort of the rest of it where we go across and provide a whole host of information Using the command line version. This can be customized in a very considerable way. I Mentioned already the importance of talking about things in standardized terms It turns out that even very simple things and one of the points that I want to make here is that Everyone knows the uncertainty In this process very close to them But they think other parts of the process are much more certain than they are Almost every part of the process is filled with uncertainty. So for example, if you return that a variant disrupts a splice site You might not be talking about the same thing if you talk to two other to two people Splice sites mean different things to different people even very simple things like that So we've worked with the sequence ontology group So when we actually say a variant disrupts a splice donor It has a defined meeting and we've worked with NCBI. So when they say it disrupts us the place donor It has the same meaning. This is actually these kind of small things can be important to advancements Especially as you try to combine data across larger domains This is what our sift and polyphen matrix looks like like I said We've calculated it for every possible amino acid change in the human genome and this is actually a very popular resource Regulatory region consequences Especially with the observation that GWAS data appears very commonly in non-coding regions of the genome The encode project has identified and been able to annotate many of those Those associated SNPs With with regulatory region disruptions or apparent ones So we've gone through an incorporating data from the encode project To return whether a variant exists in a regulatory region whether it exists in an identified transcription factor binding motif and even whether it exists in a highly informative position within that motif So you can know if your variant hits for example that great big C in the middle or one of those tiny little ACs over on the side and that might be important for you to make some decision So this comes down to our ability to answer kind of this relatively simple question Has this variant ever been seen before I think it's actually becoming one of the most common Maybe one of the most important questions in human genomics, but it's actually incredibly difficult to answer Most people who are trying to answer this are collecting all the data that they can themselves to try to answer this question so if you go back to to the thousand genomes issue of nature back in October 2010 Nature said that there was about 2700 genomes. They said there would be 30,000 by the end of 2011 I have no idea how close we are to this number But I do know that it's very difficult to get your hands on any large number of genomes if you want to use them as Any sensible control It's also true for exomes and even the data under controlled access is challenging to get as any of you who have ever tried To do so will know So just a couple of thoughts on the future. I want to make it very clear ensemble is not a clinical decision support tool And only a fraction of the important resources that we can access with it have really been presented today But I do think that some of the things we're doing shows the way forward the data is comprehensive It's versioned so you know what you what you're looking at when you looked at it when we created it standardized we used controlled terminology we updated regularly we have both evidence-based and Algorithmic aspects to it. We're fully open all the data can be taken and used And and to finally say this one more time There's uncertainty at every step in the process of the genome reference is uncertain The gene set is uncertain. There is no such thing as a single human gene set And so while we all recognize the little areas of uncertainty It's important to recognize that everything is uncertain and we just have to make the best possible decisions given that and so just to acknowledge the people that that actually do all of this as well as the people who Pay for me to go to work every day and get it done Yeah But and the question is what's the what what is our denominator here because is your denominator? 95% could be used. Yeah, your denominator there must be Information that is generated and is either publicly accessible or is not publicly accessible correct so essentially that's true if if one has a a variant of unknown significance for example It's I think it's true that about if you were to completely go and track that down Given only the variant about 95% of the information you would use to do that which might be let's say a crystallized protein structure or might be whether or not Whether or not that has been observed in in other populations. I think that's available in the public domain I don't think there are as large-scale private sources of that information I would say 95% of the information that could be used to interpret the variants that I'm finding doesn't exist anywhere So so so actually so I'm making a very different point. I'm not talking about the information to solve the problem I'm talking about the information that exists that could be used. Yes. Yes. Yes I agree completely the information required to solve the problem is dramatically missing