 Thank you very much, and it is a real pleasure to see some very old friends here, but also lots of new people. I notice everything's kind of moved on, which is great. I'm not sure I'm going to come up to that billing of defining all of the questions here. I'm not going to be able to achieve that, but I hope at least to give you some overview. Really now as a perspective of the user of this information. So, these are the things I wanted to touch upon. The impact of encode data on understanding human biology, basic biology, common disease and cancer. Quite deliberately I left out rare disease. I think that's something that we should discuss. I don't think it has at the moment made a big impact on rare disease. I think it could do in the future, but I don't think it has at the moment. I have a kind of theme on how the data is used and then what we need to think about and how to frame how we think about the next decade. Mike Paeson and Mike Cherry deserve a huge amount of credit for this wonderful site on the encode website that is collating other people's use of encode data, characterising it and sub-setting it. I had a very enjoyable sort of four or five hours browsing around this. In this I encourage you to go there and look at this so you can sub-set by all sorts of things, basic biology, all these different categories, publication year, journal. It allows you to really zoom in and understand how people use encode data. I'm just going to pull out some figures. One of them is from that browsing around. I can't actually remember the paper. It was a Plos Genetics paper. I think this is about replication origins. They were doing it in code 5.6.2 and then they were able to just dip into the huge amount of data on this cell line to blend their data with encode data across this. I think it's quite hard to imagine this group deciding to generate all of those data sets just for the hell of it or to explore this so they're able to explore all sorts of different things. This next example is an example from Common Disease. This is from my colleague Jeff Barrett at the Sanger Institute. You could have chosen many, many, many different figures that tell the same picture. Jeff's one of the leading experts on Crohn's disease, and he has spent a lot of effort in getting fine mapping down to the last couple of variants, which statistically is incredibly hard work. Jeff's plotting, or Jeff's postdoc is plotting here, is the posterior probability of, I don't know if one of these things is laser pointed, do you think? Oh no, that was a mistake. Middle, middle. Okay, big thing. On the x-axis, this is increasing certainty that this variant is the functional variant, so this is the variants where there's only one variant that really explains the signal and this is where it's a much more diffuse signal. And then what's being plotted here is the enrichment, in this case of H3K for ME1 in immune cells, particularly the T cells, and in this case of H3K27 acetylation in gut cells. Now what's interesting here is two things. One is this recapitulates a well-understood part of Crohn's biology, which indeed is about immune system interrelated with the gut. No surprise there, but very reassuring to see. And for the variants which are here, you immediately suggest that their function is in T cells. For the variants which come in this group, you immediately suggest the function of those variants in gut cells. So you know the cell line that you should go and do your experiments in next. And that is a big win for the downstream biology. Final thing, if you notice, is that it's not the case that the immune cells have enrichment in H3K27 acetylation at the same level. There are very much H3K for mono methylation, and vice versa H3K27 acetylation is something with gut. Now I think that tells you something that Crohn's is different in how its genetics works in the immune cells versus the gut cells. And that's very interesting again in itself. There's something about the setup of these cells that means that the fine mapping in the loci involved in gut is really seeing a different enrichment and a different histone modification. And I'm sorry you can't see this at the bottom. This is actually work by my postdoc, Sandra Morganala, in cancer. This is breast cancer. It was quite a delight actually as my postdoc I said, oh you should go off and download the encode data from MCF7 and go and have a play and combine a lot of somatic variants that we're getting genome-wide in breast cancer with encode data. And I had to give him a small crash course in encode, but in fact the website was very manageable for him. And I won't bore you too much on this. This is actually using replication timing, so the RAPTC experiments. Later on we also use histone modifications, and these are different types of mutations. And the key thing here is that we're seeing very, very different behaviours of different types of mutations relative to the replication timing. This is at some level understood previously, but not at this detail. So I want to use that example because it's an example that I really understand. There's simply no way that we would have gone off and asked our collaborators to generate the amount of data on any breast cancer cell line that we got from MCF7. MCF7 is not a perfect fit to breast cancer, but it's good enough. It gave us signal. So it would be better, obviously, if we did these experiments precisely on the breast cancers that we are studying for the mutations, but one has to make some compromises. And I'm particularly happy that we got some technically demanding data sets. So there are some annoying histone modifications, which I know is not the ones that are so easy to chip, and I was so glad to see them encode. And for me, in this one, in particular, Reptiseach, that would have been an incredible effort to try and set that up in our lab, or our collaborators' lab, to run and make work for us. So it totally load the barriers for exploration and discovery of a piece of interaction of cancer here with how and why particular cancer mutations arise at different frequencies in breast cancer. So I want to step back and ask what is the use of this data. Both Eric and Elise pointed out that ultimately it's very often about this hypothesis generation, but I want to split this out into three other areas, as well as general playing around with data. There are, I think, three other things which are very clear cut that people use this data when I browse through the encode publications. So one is as a design resource. So there's a bunch of experiments that happen further on, for example, promotocatcher IC, where you have to make choices about what you're going to look at. And you need a good way to assess which bits you're going to capture, when, where and how. And very often I see people going and using particular encode data sets or epigenome roadmap data sets. So it's not really an encode feature, it's really this class of data or IHEC data. This is not a code thing, it's this class of functional genomics. People using it as a background resource, we tested all other histone marks, but we found that this particular histone mark was the most associated informative of what have you. That can only be done really when you have a catalogue. And an interpretation resource for variation. The other thing which I've realised over time is that the cell context of encode data is as important as its location on the genome. So going back to the GWAS association, the most interesting thing for some researchers is not that it's H3 acetylation or H3 monomethylation, it's that it's T cells or gut cells. That is really narrowing down the set of experiments that they want to do next. And so that cell context and keeping track of that cell context is incredibly important. Now these are technical details, and if Mike was here I would be congratulating him on his data coordination centre. It's also worth, I think it's a good exercise to go back and understand how people use this data. Actually something I should have added to this is that one of the key things is that the data is open and easily downloadable. That is like a base zero point that I often forget to state. I think the number of people who work from absolutely raw data, i.e. the reads, is quite limited. But it's incredibly reassuring for everybody who sits just on the next layer that they could if they needed to go down to that level. And that gives you sort of solidity in the system. However the very first processing step, in other words this sort of transform signal, is very common. That is the thing that for example my postdoc picked up and used and I think many other people do. They don't work off necessarily the calls, they work off the first piece of processing. And then there's the kind of calling elements and I'm separating this out now into two levels, the calling experiment by experiment and the blended kind of things, which is for example segmentation. Segmentation has become extremely useful. People I think is almost a bit too useful in the sense that people are using it as if it's a truth and it's not really a truth, it's a model of something very complicated that's going on underneath there and it will change and evolve over time. But it's an incredibly useful way of collapsing a large complicated data set down into a manageable way of thinking about chromatin and a particular cell type. Now I want to step back and this is trying to put projects like encode, epigenome, roadmap, IHEC, all these other projects into context about what I think we need to happen over the next 10 years in the human genome arena. And I've come to feel that there's these three steps that we go through in building resources and actually sort of put the human genome up as one of these right at the start as well. So to build a catalogue, to classify and then to curate. I've given you some examples here in this table and about my feeling about where they are in this stage and quite deliberately there are things that haven't made it to the bottom right hand corner. In 10 years time, 20 years time, 50 years time I'm not quite sure. Everything should be in this kind of right hand corner where one is tidying up and controlling and looking at the details at some level. That's where we want to be. I want to encode, encode, roadmap, IHEC is doing this, catalogue it. Or mainly doing this, catalogue it in my view. This is what they mainly should do. And I just want to distinguish a catalogue from aggregation because it's quite, they look similar. Both of them are about putting a lot of data together that one can download and reuse and do other things but there is a difference between building a catalogue and aggregating data sets. And that's this, that it's comprehensive. So it's a, it's, it's, you've defined a goal to generate your catalogue and then you want to reach that. And there are these annoying, you know, biology is annoying and so you can't finish the human genome through the central mirrors and you're going to have to make a goal that acknowledges that until a better piece of technology comes around. You can't, we'll come down to this, get every cell context as a sort of definable thing and so you're going to have to handle that in some way. These I think are the challenges to get over in still in the concept of comprehensive. So one aspect of comprehensive which was a big thing back in 2002 and is just not a big thing at all these days is that everything is genome-wide. We no longer debate that very seriously. And that's a good thing. We've got a bunch of countable things. I'm keen that every transcription factor gets measured. It deserves it. They're countable things. Every transcription factor we should know where it bands and Mike is slightly smiling and sighing at that. Potentially every histone modification but there's also these other, this other axis, quite what level of expression level you go down to. This is a, there's an interesting debate to be had about what pieces of biochemistry are we interested in in different cells because one could spend a lot of time measuring an awful lot of things even in just one cell. And also, as I mentioned, these cell contexts. So cell type, but I think we do have to move away from just a static view of cells. My own instinct is that there's going to be more in response so that we're going to need to think about cells making decisions and cells responding to signals as much as we do about what you might call ground state, whatever that means because they're in some dish being stimulated, being fed or what have you. So we need to think about cell contexts I think in a broader way. So just going back to my three step thing, catalogue classifier curate. I just want to plug about classification. This is sort of genomic science. This is trying to understand what is going on. And by necessity this is just more anarchic. I just think it's fundamentally an anarchic process of people trying to understand why certain events happen, certain pieces of biochemistry happen in a particular way. And I want to put a particular note here about model organisms. I think there's absolutely no doubt for understanding human disease we need the catalogue in humans. But to understand the classification of events and to understand what is going on, we really don't care where we make that insight from. It is going to be valid. Experiences told us that the insights that we make from everything from yeast in our albidopsis up are very, very likely to play a major role in our understanding of human classification as well. Now that doesn't mean necessarily that one has to build a catalogue in every model organism or something like that. I think it's bonkers not to leverage model organisms in this space. And then I want to have a particular aspect about variation. I noticed that there's a number of projects, the GGR project and the Funkbar project and other projects which sort of acknowledge this intersection between variation and functional behaviour. But I think it's an area where arguably you have to go beyond classifying and towards a model. In other words, a model that says this variant occurs, this is what I think is going to happen. And you've got to make that model, you've got to instantiate that model ultimately mathematically. And then because it's so close to what I do I just want to point out that it doesn't end after you catalogue and you classify, you have to curate. And the human genome reference is sort of in a long term but still it will not end curation mode. The human protein and NCR set, this is the U41 grant with GenCode. And I think we're getting close to the point where regulatory elements are going to go to that. There's a handful of known cases, I mean classically the SOX, the PAC-6 mutations and the Sonic Hedgehog mutations. The TURP promoter in cancer would be another established scenario where we really understand the relationship between a variant and disease-causing behaviour. But there are a number more that we've got to really get to. This perhaps is not for this workshop but it's worth thinking about because I think it feeds into that. The final thing I'd like to say is I have been blown away by the improvements in imaging over the last three years. If you haven't you've really got to fall in love with super-resolution microscopy and also EM tomography. The ability to look at life at an atomic scale is remarkable and it opens up completely new techniques. And for a genome biologist it's like a completely new axis on this. And I think the future of the regulatory field but in particular the chromatin structure field has got to totally blend with this high-level imaging. So these are some pictures from my colleagues in Emboldt Heidelberg. This is actually DNA spread out here and the resolution one can get to is below 50 nanometers of individual molecules. Now that means if you can do that at scale on cells you can really start to see individual chromatin strands and work out where they are relative to each other. So I know there's a 4D genome is precisely in this space but we need to make sure that those links are created. And I had a thank you slide but it clearly didn't make it across. So I'm really sorry. I'd like to thank Mike for that wonderful website. I talked with stuff from Jeff Barrett at the Sanger Institute. The cancer data was Sandra Morganella with Serena Nick Zenial from Sanger and these pictures are from Jonas Rivas at EMBL Heidelberg. So thank you very much indeed. So I'm happy to take questions. I hope I've stimulated some. Carol. Thank you. I wanted to ask you to follow up on the example you had of using the encode data on MCF cell line, breast cancer cell line. So some of the other NHGRI and common fund programs we're going to hear about in a minute are also generating large scale data for MCF, right? So the links perturbation project. So can you comment a little bit on the challenges of integrating the data from these different scaled resources to drive a biolog? Can you pull in links data, for example, into the work your postdoc is doing on MCF? I mean, I think that challenge is not a challenge that is unique to this field. That is a challenge that is present across all of biology which we're asking to integrate high dimensional multimodal data. So high dimensional meaning we measure things with many, many different dimensions. Multimodal meaning we do it with two or three techniques where we, you know, it's not obvious how one combines those different modes. And I don't think there is a simple thing way to do it. I think you have to understand the questions you want to ask and then you need to have a skilled computational biologist and they need to have a toolkit of data cleaning and then statistics. And frankly, the data cleaning is 90% of the drama which is quite annoying. So that would go to data quality being very important in this game. So that's where I think that goes towards. But I think that this problem is not a problem unique to this field. This problem is across all of molecular biology now. Aravinda. So Ewan, there was a very good sort of overview. I'm wondering if you're thinking on maybe it's just one question or first a comment which goes to the same question. You mentioned the precision we can have when you narrow something down to a gut cell but for those of us who study the gut, there's not one kind of cell. So the question is of resolution. So that's what I want to bring up. You know, in the cases where in common disease studies we've been able to at least get an inkling of hypotheses like you show of the kind of histone marks. Even there, you know, there is statistical enrichment but the enrichment is two-fold, somewhere thereabouts. So, you know, that's about 10% of all the sites we know. So the question is how do we get to the others, you know, this? So is it just greater resolution in types or what? Is it more perturbations? I mean it's a good, you're absolutely right that, you know, it's rare just to sort of focus this, it's rare that the perfect experiment done on the perfect cell line with the perfect mark has been done beforehand. What you're doing is getting kind of close enough. So you're getting gut cell but is it a deep stem cell in the dividing gut or a mature cell halfway up or what have you? You know, that's the experiment that perhaps has to happen somewhere else. Perhaps it's better cataloging. And I do think, I mean I think of this as a matrix where at the end of the day you want to have enough information that you can reliably impute all the places where you don't have information. So maybe you don't do every histone mark in every cell but you do enough histone marks in enough cells that you can get a model that you can then impute into all the other cells that you have some limited, more limited approach. Viewing this whole problem in a very statistical mindset but I don't think that's a bad way of thinking about that. But just to add to it, I think that's exactly the point that one thing that you didn't emphasize but you just did which is this modeling part, how well are we explaining even gene expression, the gene expression that we measure? So in some sense the better you can do that with all of these elements. Absolutely and I think doing that better is a good thing but I wouldn't, well I would be absolutely for just getting enough data sets that we dominate the problem. That the problem is stops being computation, well stops being sophisticated computation and is instead much more a kind of, well these are the data sets. So we have the option to generate data sets in a smart way to make that imputation work well for us. Paul? Ewan, so this is about aggregation and cataloging. So one of the uses for the encode data has been that it's systematic in all in one place and it's lovely and all that but if we had an aggregation of all the other smaller scale project data of the same type, how far would we have gotten? One can answer that question in the sense that the aggregation of chip seek happens as well. Most people do submit, the metadata does come in so you can look at that. An incredibly interesting study I thought was done by Albus Brasma a couple of years ago about looking at submitted microarray data sets and microarrays was a technology where there wasn't a cataloging effort. If you remember the Novartis Foundation did that. I think everybody used the Novartis stuff but there was lots of proposals to do more of a catalog and then none of them came through. And what Albus discovered is that most good experiments happened on blood. Absolutely no surprise at all. Most good human experiments happened on blood and I can totally understand why that happened and then you got this kind of really annoying tail process to this. And so I think there is something different between that aggregation and cataloging and what that argues for me is that when you're cataloging you've got to go to the corners which are hard. That's a really important part of the process. So the microarray experience is not a one-to-one mapping but it's an interesting data point because in some sense it was a case where only aggregation happened. Sorry, I don't know your name. Yes, imagination. I would agree. I don't know your name. Sorry, Bert Andrews. Brenda Andrews. I was surprised. Brenda Andrews was saying that it's important to take model systems and generate large amounts of data off them even if you don't quite know or you don't have a clear-cut line of sight to the disease endpoint from that. I would totally agree with that and I wanted to focus in this middle bit of classification which is incredibly important. Generating data just by itself doesn't inherently give you insights and my experience again and again is in particular in model organisms there's some quite peculiarity of the organisms but it is set up that X is an experiment that is very easy to do in this model organism and then you go off and do it and then suddenly you get some insight. There's very often not even what you were planning to do in the first place. I would totally echo that feeling. I don't think that detracts from the need of making a catalogue in human at all. Dana? I just want to make again a comment regarding data quality so you merged ENCODE and TCGA and we've had a lot of experience with both of them and ENCODE data is significantly higher quality than TCGA data in terms of the amount of artifacts, reliability, cleanliness of being able to take it and use it in a significant way and we also download often when we don't have the right tissue in ENCODE because I'm a very big believer in context and tissue and very often even in cancer the same cancer breast cancer because of the copy number aberrations and all the rearrangements then MCF7 really isn't relevant for another cancer and I've seen huge variability yes, some aggregate people that do different chip seek like analyses do very high quality and others put their data up there and you see like this dramatic drop so it's all about data quality garbage in, garbage out it's not enough to have it measured it has to be measured right. Absolutely true. By the way one thing I would say having now lived some of the cancer stuff I just think cancer is like hard cancer variant calling officially I've decided it's hard I don't want to go down there Yes, you can say that to Gadi I have a newfound respect for anybody who does that I would like to follow up on the comment from before when you likely said it's an immune cell and you got the comment already immune cells are many different also within one type even if you're very stringent in the sorting you'll find an enormous variation across a say 200 that we're doing at the moment much more than you would have expected intuitively so I think it's an important point is to get away from standardized cells and cell lines to primary cells which are exposed to environmental cues to metabolites and so on so I think that's very important and I didn't see I'm sorry I didn't bring that out when I did say cell context I did say we've got to go away from the static one we've got to look at response I should have also brought in single cell genomics as well part of that response is to move away from a population view of response to a single cell trajectory view of it and I think that's also incredibly important That's one but also the genetic variation and its influence and how it affects the epigenome is a very important one so not just looking at 100 cells in a population but looking at 100 different donors and look at the same cell type That's also that absolutely Aviv So could you elaborate a little bit about your thoughts around responses and stimulations first of all it increases just the dimensionality of the problem that and by the same token going into genetically variable individuals be them from a model organism or for humans and into primary cells and into cell types which of the many different dimensions in which we could go after variation It's going to be the most useful Where's the sweet spot? I don't know that we I agree that this is a complicated design area basically and I don't think we have enough strong data to tell us One thing I do note is that I feel that the GWAS hits for disease are not having as strong an overlap with the vanilla CIS EQTLs that are now coming out in their thousands for this Would you disagree with that? They agree when you look at cells that are stimulated and when you look at primary cells that are relevant to the disease Yes, so for my aspect I think the biggest amount of agreement is that stimulation or response aspect rather than necessarily the ground state so if I had to kind of bet I would be betting to looking for response effectively decision making by the cells and that's understandable to think that variants that affect the balance of a decision are much more likely to be involved in a disease but I think the short answer is that we don't have enough data to understand those different dimensions and what we've got to do is effectively make sensible designs now but pause after a certain period of time and say right what is really worth doing at what scale, at what level now that we can for example do single cell genomics or epigenomics to what extent should you use that technology in which primary cell populations and to what extent shouldn't you I mean I just don't think we have enough information to know where that trade off point is and that trade off point will be very dependent as well on cost so technology comes into the mix here as well So I have one follow up question also to the I think Arvinda and someone else who asked over there and it also relates I think to what Dana contrasted between ENCO that was mostly done on cells and TCGA which is done on tissues you know God forbid also cancer I don't think the big difference is but the question is tissues versus cells Yes, well homogeneous cells Yeah But tissues I mean the blob you know the real physical I don't think there's an easy Do you think there's an easy answer to that? No, that's why I get to ask I think you've got to be data driven and make a good justification in each scenario I know with interest in Drosophila people happily chip seek the entire embryo and get good stuff out of it and you know it's kind of slightly I'm like whoa what happened there but that is slightly about you know designing your experiments such that you know you can deal with the interpretation afterwards that it's a mixed cell population I don't think there's an easy answer I'm sorry about that Jay you look like you want to ask a question Mike I think just to comment on that further this gets wrestled with all the time and I always look at the metagenomics field where you can sequence incredibly complex mixture of bacterial samples and you actually still can extract a lot of information even though we don't always know what all the bacteria are in there so it gives you some level of information and having the individual species is obviously incredibly useful too so it's another it's two dimensions basically complex mixture I mean anytime we've wrestled with this lot in encode when you start isolating cells if you purify hepatocytes out of a liver they're not exactly the same as the hepatocytes in a liver so you're always struggling for these things and you know it's the nature of the beast Just to touch on it back to imaging here I think imaging gives you an orthogonal modality for measuring things in cells and so a completely orthogonal modality and sometimes that works better in vivo, close to vivo than the genomics technologies and so I think there's going to be a lot of interesting play between oh I've taken this and I've also put it under a microscope and I've done this clever piece of labelling and la la la la la and now I can do that clever piece of labelling in a different context where I can't do the genomics but that will give me insight into that will help bridge those things so that I think imaging is another modality should also mention by the way I didn't bring this out is cohort studies it's another dimension that Hank brought out and that's another goes to these functional studies the extent to which cohorts think about molecular phenotyping when they talk about molecular phenotyping they'll go from RNA-seq into these sorts of chromatin measurements but in blood cells probably because that's what you get a lot of in cohort studies and the interplay there is quite interesting so just the accessibility of blood means that there'll be a lot of focus on that as a tissue and that's just the way life is we're not going to get around that I feel like I'm done I'm just wondering if obviously you're steeped into this as to where the technology is at that when we do get tissues which is a mixture of many different cell types even cell types we may not normally recognize the kinds of methods we use is there some way with a long range contiguity of epigenetic marks could be used to recover at least I mean I've turned this question to Aviv you know locally in our shop Sarah Teichman John Mariani and Oliver Stagall has done great work in handling single cell readouts and working and deconvoluting so let me just tell you about another problem in single cell is that of course you've got the cell cycle going on so you've got to think about I'm measuring things and I know that cells are somewhere in the cell cycle so you've got quite a lot of things going on there but I you know I think I think it's quite manageable actually the real problem is cost to be honest and I think the techniques for that are the techniques are quickly maturing and the costs are dropping so we shouldn't leave I should have had single cell as the other opportunity and not just with sequencing there are techniques for example that Dana is a great expert for looking at protein levels which I think are going to matter a lot so just to weigh in on this difference between cell lines and purified individual cells we work really only on primary immune cells which are very highly purified by flow cytometry but nevertheless when we do experiments like looking at accessibility or gene expression or whatever on them we find ourselves going back to the cell line encode data sets for history and modifications and so on just to identify those things that we can do again in the primary cells to give us a complete picture of what's going on so we don't have to do everything under the sun so I think actually that goes to this kind of you know you've got these sort of hotspots where you can generate lots of data deep and then you've got other places where you're going to explore another dimension doing this in a structured way at least I feel this discussion has reached its natural end and before we just debate single cells versus tissues versus cell lines so you and before we let you off the hook before we let you off the hook at the very beginning you mentioned something about rare disease as an opportunity I wondered if you wanted to say a few words I think I don't I'll be interested in other people's view in the room but I think the places where encode data has made a big impact has been in common disease particularly this association of your loci two different cell types and then helping you film that and in cancer it's becoming just really important on whole genome data to understand so certainly you just can't do it without having a good grasp of replication timing like we need to have replication timing on a very large number of cells otherwise we really won't be able to make good decisions about cancer mutation recurrence rates but with rare disease I don't see so much systematic effort and I now hand you over to my colleague over there he can tell me that I'm bonkers on the money I think you're right there certainly hasn't been systematic effort done in this space yet and it's for a few different reasons we don't yet have large enough cohorts of rare disease patients with whole genome sequencing data where we can systematically look for these types of variation I mean I think with good reason we've gone after exome sequencing in these cohorts early because that is where the easy where the low hanging fruit is but that will change pretty rapidly over the next two to three years there'll be large efforts NIH funded and elsewhere to do whole genome sequence on tens of thousands of rare disease patients and with that type of data we'll be able to make sense of that is by having much finer granularity on how we do functional aggregation of variants in non-coding regions and how we distinguish benign variation from deleterious noise just as we do in protein-coding regions and so data like encode will be absolutely critical for maybe 5-15% of rare disease patients where there is some non-coding cause we have to untangle and what do you think about modifiers as well because I think the other angle to rare disease is really thinking about every rare disease it's not actually Mendelian but in fact oligogenatic and these modifiers I think there's a big space there as well of getting at the modifiers of penetrance and that sort of thing Yeah we've been so limited by sample size and it's only as we start getting thousands of cases that we can start doing modifier searches with any robustness but it is exciting Is there anything specific that could be generated either a technique or resource that would be impactful and so on assemble or is it just a matter of just waiting for the samples So can I respond to one aspect I think for Mendelian disorders all the sequencing that is done including clinical sequencing of well-defined phenotypes for which heterogeneity is not in doubt CF for example there's a small percentage of patients that don't have CFTR coding mutations Okay Now classically people have waved their hands because even the sequencing wasn't done comprehensively of the coding exons but now there are numbers of people who've done this so I think you don't necessarily I understand what Daniel is trying to say that would be great but there are defined disorders overwhelmingly single gene even if there are few loci for which there are coding mutations that are not known and I think they would be a great place to begin I believe that there is this myth that all non-coding variants have very small effects which is why I think that myth is to be challenged as well and I think those would be very interesting but what I would agree with Daniel is that we've got a train coming at us which is the UK 100,000 genomes where a whole bunch of rare diseases are going to be sequenced across the entire genome like it or lump it we're going to need to have analysis methods that get into this but these two viewpoints are not incompatible and of course I totally agree with Aravinda there's great cases where you can make headway we have a cohort of Duchenne muscular dystryfi patients where no coding mutation has been found and in a number of those cases we have been able to find non-coding causes so that's a great cohort to go after but for the vast majority of the rare diseases it'll be harder I'm going to mention a reference there's a beautiful paper Weedon et al which is a perfect example of this is a natrogenic serum I think maybe a year and a half ago where the encode dataset wasn't available they didn't have pancreatic progenitors so they made their own enhancer map which is effectively identical to what encode would have done in that cell type and used that to nail a non-coding distal enhancer mutation so I think there are anecdotal examples like that and it sort of fits the model if you just improve the resource it will get used when the time comes right done? thank you very much