 Our next speaker is Peter Donnelly. Peter, if you want to speak from here or the podium, either way is fine. I would remind you that we've actually asked the speakers to speak for 15 minutes. I think our timer is set for the full 25. So to make up for that, I have little signs that I learned to make for my friends at Oxford. So there for you, Peter. Peter is the director of the Welcome Trust Center for Human Genetics at the University of Oxford. And he'll be speaking on perils and promise of genome sequence analysis. Thanks very much. Thanks for the gentle reminder about timing, Terry. So Rick and I got a very helpful having agreed to do this, which I thought at the time was a bit brave and I now think is very brave. Rick and I got a very helpful email from Terry and Eric a week or so ago, which said, just want to point out three things. Thanks for agreeing to give the talk. First of all, you've only got 15 minutes. Secondly, we're very grateful for the experience you'll bring and the perspective you'll bring and we're sure you'll have a bunch of interesting things you want to say. And thirdly, would you mind making sure you cover the following seven questions in some depth? Rick did a good job of covering the seven questions, but I'm going to adopt a slightly different strategy, which is to pick a couple of those and focus on things that I feel I see a bit more clearly and have more of a sense of. I'm sure many of the other issues, well, almost all of them have already been touched on and many of the others will come up at various stages during the meeting. So I'll talk about three different things. First issues to do with data access, some of the analysis challenges and things to do with sample sizes and how we should think about scale of the kinds of studies that we might be contemplating. The first one, data access. So Francis already helpfully pointed out that there was a meeting a couple of weeks ago on exactly these kinds of issues. We shouldn't revisit all of that, but I do want to make two points because I think there's one that's directly relevant to the kinds of things we're thinking about here and one of them that is relevant to our field moving forward. It's kind of widely accepted now I think that if there's large public funding or in some cases in the UK as well, charitable funding for genomic scale projects then that kind of data should be very widely available and widely available as possible to bona fide scientists. And that means both the sequence data or the genomic data that's been generated and the phenotype data that's available on those individuals and we kind of all, I think everyone agrees to that and signs up with it, signs up to it. But the fact that there was a meeting a couple of weeks ago which I wasn't involved in is a reflection of the fact that we haven't actually got that suss yet. We aren't doing it as well as we could be and I think quite a lot of effort is needed. The relevance for our discussions I think, as Francis said, is that an absolutely key issue in terms of thinking about doing large scale sequencing of cohorts is to make sure that the individuals whose DNA we might be sequencing have in place the right kinds of consents to allow data sharing on the scales that we necessary. So I think it's an absolutely critical issue in terms of the requirements we should be thinking of for prospective samples. The other point I want to make, which isn't just in the context of what we're thinking about in this meeting, but more generally, I think we need really firmly to have in our minds that what we would like to do, so I don't know what happened at the meeting a couple of weeks ago and I haven't had a chance to read the recommendations and maybe this was exactly the kind of thing that was discussed. But I think we want to be working towards a world in which there's a pretty large joined up database in which in the future we and possibly even clinicians or at least those advising them will just be able to look up. Here's a particular sequence variant in this gene in this position, it's this amino acid change or whatever. What do we know about other individuals who have that change and what phenotype information do we have? So we need to aim towards that and it's not the kind of thing that we'll get to unless we work pretty hard on it. It's exactly the kind of area where I'd urge NHGRI and possibly some of the other institutes to take a really serious lead. We won't get to the position of having this kind of resource in place unless quite a lot of work is done. It's also not very glamorous work so it's not the kind of thing that, as I said, it's not the kind of thing that will happen by accident. I want to say a little bit about some of the challenges on the analytical side for the scale of project that we're talking about. Some of this touches on things Rick raised. I want to be slightly more cautious maybe than he was. So in terms of current technologies, methods for calling variants from sequence data are pretty good now for SNPs. They're not a done deal and they're much better than they were a year ago and they'll keep improving. We're still not very good at calling short insertions and deletions from short read sequence data and we're not at all good at calling copy number variants from short read sequence data and there's a substantial amount of work that needs to be done to get us in the position where we're better at all of these things. I suspect that over the time scale of the sorts of projects we're talking about, we will have got better of those. Although each time a new sequencing technology comes along, we have to reinvent some of these wheels. Rick talked about this as well. Annotation of variants, so I mean that in two different senses. The first is just the kind of naive and obvious thing about whether a variant is non-synonymous or not. We did an experiment recently in Oxford with sequence data we were collecting in a translational project and just applied various of the available annotation packages to it and the results were a little bit disconcerting. So for example, there were 70,000 variants that were called non-synonymous in one version of the annotation, by one annotation package and synonymous by another. So it's the kind of thing that we think should be already sorted and it isn't. Again, those kinds of issues will improve the challenge of knowing what the consequences of a particular variant are. Loss of functions relatively easy to detect but for other variants knowing what their functional consequences are that's also a field in which I think that there's a lot of scope for extra work. We need to, if we're thinking about sequencing cohorts as we should be with rich phenotype data available, we need to be to work through the problems and I think there are discussions of this in some of the sessions tomorrow through work through the problems and challenges of just linking very large amounts of sequence data with possibly very, very large amounts of phenotype data imaging and so forth. Those are just non-trivial IT challenges and we'll talk a bit about them tomorrow. And finally, in terms of one of the things we hope to do which is to understand the relationship between DNA sequence variants and phenotypes, particularly disease phenotypes, we're not very good at the moment I think in terms of the maturity and the efficiency of analysis methods for doing that. We're rather spoiled in the days of genome-wide association studies. The obvious thing to do in a genome-wide association study was just to look at each SNP one by one, do the naive thing of testing each single SNP for a difference in frequency between cases and controls. That's what people did in the genome-wide association studies and it turns out that you get a long way with that strategy. In the case of sequence data, that doesn't work. There won't be individual, well, with perhaps a small number of exceptions, there won't be individual sequence variants that we don't yet know about where we can just look at that variant and look for differences say between cases and controls or in quantitative phenotypes. So to see signals, we need to amalgamate information within units where the unit might be a gene or it might more ambitiously be a pathway. There are a bunch of methods out there. If you've looked at the field, there are a bunch of methods out there which aim to do that. So the hope is to somehow combine variants which are maybe rare or maybe have certain predicted function and so on. But again, we need to work harder on that. I think this whole area is one in which if we look at the methods we're using in three or four or five years' time, I'd hope they've moved a long way from where we are now. It's the kind of thing I always say in this kind of context, but if we're thinking of large-scale sequencing projects more and more and particularly for sequence data, we need to think about putting aside substantial resources for their analysis. There's huge potential to combine rich genetic variation data, sequence data, ideally whole genome, and rich phenotype data, but there are major challenges in getting as much information as we can out of that. And it's silly just to invest in the generation of the data without giving ourselves the chance to harvest the rewards from it. Last point I want to make, when I was thinking about it, I was reminded of the apocryphal story that many of you will know from one of the Sherlock Holmes novels when the policeman, who happens to be called Inspector Gregory, I learnt when I looked it up on Wikipedia, says to Sherlock Holmes, is there any point to which you wish to draw my attention? And Holmes replied to the curious incident of the dog in the nighttime. And Gregory says the dog did nothing in the nighttime and Holmes, in his characteristic smuggle way, says that was the curious incident. So what's a connection here? It's something again that's already been alluded to and it's the following, we've already done a lot of sequencing. Francis mentioned 65,000 samples at NIH who were involved in funding. Here are some of the big projects, ESP project, several type 2 diabetes projects, projects I know have been autism and schizophrenia, they're just some of them. Actually there's quite a lot of data from cancer normal pairs but I don't know how much effort is being put into just looking at the normals to look for germline susceptibility issues. So we've done a lot of sequencing already, many of those projects are relatively mature, they've been going for two and a half or three years, none of them has yet finished, but so what have we learnt so far? Well, one thing we've learnt, there aren't many examples that we know about yet, where those studies have led to startling new discoveries. As I said, none of them is yet finished, but there's some information in the fact that I'm involved centrally in one of them and peripherally in one or two others and I've spoken to people involved with the other projects, most of them would say that QQ plots, so genome-wide measures of how much signal there is, look pretty flat. So there's information in that, I'll come back to what it is in a minute. The other thing is that there's been a huge amount of effort into what's called imputation, using data we have from a relatively small number of sequences where we know patterns of linkage disequilibrium and using that information to predict what those variants would look like in very large panels with genome-wide association data. So imputation is something that works reasonably well, if we think about variants of frequency one, two, three, percent, we know that we don't impute all of those well, but we impute many of those well and that work's been done into very large cohorts of tens of thousands of cases and controls for some diseases and again, it's been relatively unfruitful to date. So where have we got? We don't know the full picture, of course we don't and in some sense, as Eric Boerwinkel said, it's not, you know, we don't have to debate the pros and cons of this, but things could have been different. There was a hope that there would be so-called Goldilocks mutations, mutations at low frequency, not very rare, but low frequency with larger moderate effect sizes, things like PCSK9, I think there's growing evidence that there at least aren't as many of those as some people might have hoped. Particularly from the imputation data, we've imputed these frequency ranges into very large sample sets. Not all of that imputation works well, but some of it works quite well, so if these were relatively common, we'd have seen more of them than we have. I think it's becoming clear, not surprising in some ways, but it's worth stating, we need to look at very large sample sizes. I think most of the people involved in the sequencing projects to date, if you talk to them, would say, we need to look at more samples and they're already not trivial, so I think we need to bear that in mind. Large sample sizes and the ability to follow up substantially anything that looks interesting. And I think at the moment, we don't yet know how important rare variants will be and how important their role will be in terms of common disease phenotypes. We all see the attraction, as Francis said, of finding examples of humans who are homozygous for loss of function mutations in particular genes. Obviously very attractive to drug companies who are trying to use that to learn about the consequences of particular drugs. Some of those drug companies have looked for those in genes that they're interested in with not great success to date. As Francis said, different populations will be important in different ways for these kinds of questions. Again, a kind of sobering thought, it's not my calculation, it was done by someone I trust. PCSK9, to find PCSK9 in an exome study if we're insisting on something like genome-wide significance would require about 30,000 samples. So I think we really have to think if we're doing this at all about doing it in reasonable sample sizes. Here, I'll finish just with some power calculation slides or A-slides, some calculations that I borrowed from Mark McCarthy so they're somewhat type two diabetes focus. So this is three different settings where you're doing either, for reasons that were natural for their purposes, 3,800, 9,000 or 12,800 exomes and then following up with targeted sequencing of those genes in 10,000 cases and 10,000 control. So not a small experiment. Using a particular test, taking through a certain proportion of variants to the follow-up sequencing. So these represent kind of different genes with different sorts of effects to give you a sense of calibration. So nod 2, which is a gene we now know quite a bit about because of its role in Crohn's disease. So it's been known for a while. It would sit about here in terms of its effect. So we're probably looking at this range of the spectrum and again, the message I want to get across is that if we get into these kinds of studies, based on what we know so far, they really will better be large. Thanks very much. Great, thank you. Thank you for staying so close to time. We very much appreciate it. Comments for Peter? Questions? Peter, early on in your talk, you alluded to sort of the creation of a massive database. Can you put a little more details on what that might look like, what that scope might be? Well, here's the kind of dream we'd like to be in the position. I think we, as researchers, and I suspect at some time in the future, possibly those involved in clinical care, would like to be able to, when they have a patient who has a particular mutation, a particular gene, they'd like to look up and say, okay, which other people have been sequenced, have that mutation, and what do we know about their phenotypes? So it's the kind of thing we like to aim for. It obviously involves, I mean, there are issues of taking existing data sets and putting it in a form where you can make those kinds of queries, but actually just bringing it together as much of the sequencing as possible for which there's phenotypic information. And the implication from that explanation is that currently existing databases are not properly structured, not properly scaled. I mean, it sounds like there's a deficiency that you would like to see addressed. Yeah, so this, I mean, I'm guessing this is the kind of stuff that was discussed at the other meeting, and I don't know, I hope it was the kind of stuff that was discussed at the other meeting, and I hope there are recommendations in place that aim to deal with it. We're a long way from that. Now, even in the cases of projects where you can get at the data, you kind of get at the date, and that's not as straightforward as it could be, even in settings where one's trying hard, I think, you can get at the data for a study. And so to do that kind of thing of saying, what instances do I know where someone has this mutation or something that I think might be like it in this gene in terms of their phenotype, you currently need to look in a lot of different places. And each one of those look-ups isn't straightforward. So Maynard and then Thomas. So with respect to rich phenotypic data, you can make the phenotypic data rich by going for breadth, you know, looking at the whole EMR for a patient or everything you can gather about EMR or HER, or you can go deep in some particular area, which tends to be the focus of most current studies and very careful diagnostic work with respect to some fairly particular phenotype. Choices are gonna have to be made between broader phenotypes or deeper phenotypes. Do you have a view about which way we should go? I don't think I have a well-informed view if, should I express a view at all? I just say as a practical matter, having argued that we're gonna need large cohorts, that's another, you know, that it tends to be the case that where there's a lot of in-depth, just for cost reasons, a lot of in-depth phenotyping, that's more likely to be in smaller cohorts than very large cohorts. So that will be a practical issue. Yeah, I think there are open questions and I don't think I have strong sense of what the right approach is about whether to go in a lot of depth or not. I mean, one of the lessons I think that's come out of what we've learned so far from genome-wide association studies, there were debates that happened five years ago that said, actually, we shouldn't really be focusing on disease as an endpoint, we should be focusing on this biomarker or that biomarker. So for example, instead of doing studies of type 2 diabetes, we should be looking at fasting glucose levels as biomarkers. And what we now know, not that it's my area, but what I understand we now know is that, actually there are some variants which affect type 2 diabetes in an outcome that don't affect fasting glucose levels. There are some that affect type 2 diabetes and affect fasting glucose levels in the direction you'd expect. And there are some which affect fasting glucose levels which don't seem to affect the disease outcome. So it's certainly less clear than it, oh well, I don't think it's obvious that we're better off focusing on intermediate phenotypes. I think the lesson, at least from the genetic architecture in terms of common variants is that that doesn't always pan out the way you'd expect it to. So it's on this point. So surely this depends a bit on what your study design is. So if you're, for instance, interested in a particular phenotype, I mean, I mentioned earlier, the idea of something that's protective against Alzheimer's or, for instance, individuals who are morbidly obese but have normal glucose tolerance, you could imagine then that you would have done pretty careful phenotyping about that issue and then do your exome sequencing and see what you find. But I think a lot of the time it's gonna be the other way around where it's the genotype that drives your interest. And I would bet in very few, if any, instances would you be satisfied with the phenotype information that you had on those individuals who turn up with particularly remarkable genotypes because it's gonna point you in a direction based on what's known about that gene or that pathway that you're gonna wanna go neaper. And I guess the correlate of this, and maybe, Peter, this sort of question in terms of your comments about consent, it's probably then not just consent for broad use of the data as was collected, but it's also consent for re-contact and the opportunity to do that deeper phenotyping, otherwise you're left forever in the dark. Fair? I completely agree. I think the value of samples that you can go back to to do additional phenotyping will be huge. I completely agree with you. Great, Thomas. This was just an orientation in terms of the large existing databases, sequence data and phenotypic data, many of the existing studies of limitations in terms of consent for what kind of studies it can be used for. So I think everybody is in favor of having these large connected databases, but with existing, it's not always possible. And also, the existing samples are already consented, you know, are frequently not consented for such a broad use. So in a sense, we are talking about prospective studies where we have to either re-consent or develop new studies and do new sequencing. Also, I'm wondering maybe people who are in the LC field can comment on this. How easy it would be to actually propose to use existing databases for clinical use because they are not, you know, these were research studies and they're not always generated for clinical use. Just a quick comment on the consent issue. You're absolutely right. And there may well be issues in terms of having to re-consent and so on. But I mean, there are some large collections. I happen to, Rory Collins can speak to it. UK Biobank is one of them, which do have very broad consents already in place for large numbers of samples. But it's clearly an issue in terms of thinking about choices of cohorts and some of the practical challenges. Maybe I could, well, we ought to move on, but I was going to take the last question, but I could never do that to my friend, Chris. So Chris, go ahead. Just one, that was a great presentation. And you talked about, we don't know what the analysis strategy is. Many of the GWAS studies have focused on meta-analysis of individual cohorts or large cohorts like WTCC and then combining those in kind of a meta-analysis. Do you think that that's an approach that is possible with whole genome sequence analysis or that all of the data need to be in one place from the multiple cohorts or multiple samples to conduct these kinds of analyses? No, I'm sure that's a good idea with sequence data as it has been with GWAS data. That we now know that the sequencing, the genotyping assays we use in genome-wide association studies are very robust. You can run them almost anywhere. You get the same kinds of answers and so on. And that makes it much easier to combine data. I think with short-read sequence data, there are more vagaries and artifacts and also the genome is huge. So you'd need to be a bit careful, I think combining data sequenced with different technologies at different places, but it's not impossible. So Peter, I assume that these figures are based on unrelated cases and controls. Yes, yes. So how, just approximately, how do you think this might be affected if one were using multiplex families rather than unrelated cases and controls? Good question. I can't do power calculations in my head and I haven't done it before, so I don't know the answer. Great, thanks.