 Thank you very much Teri. So I have a fairly broad mandate here which I think is actually much broader than my specific domain of expertise. A lot of the work that I tend to do is on downstream analytics of large scale sequencing data but it's been good. I'm glad in a sense that I was assigned this task because it has given me a chance to interact with a whole lot of other people who spend a lot of time thinking about upstream processing of sequence data and other types of complex data sets. And I just wanted to acknowledge upfront that a lot of what I'll be talking about today has derived from conversations with the people up here on the slide. And particularly to acknowledge the very fruitful discussions over the last few years with the 1000 Genomes Analysis Group which has really shaped my view of how we deal with large heterogeneous sequencing data sets. So here we have, this was the plausible near future scenario that I considered as I started to think about the challenges of aggregating sequence data from an informatic perspective. So let's imagine that we have exome sequence data. It could easily be whole genome sequence data instead. We also have complex phenotype data and these are available for 100,000 individuals derived from multiple different sources. Now currently that would comprise approximately a petabyte of raw data which is a vast amount of data. I'll give you some numbers in a second. And at the moment at least the vast majority of that data in fact almost all of it would be the raw sequence data. So phenotype data, while it takes a lot longer to collect and is much more difficult to do so, actually constitutes a very tiny fraction of the overall data that would be possessed for these individuals. Although one of the points that I'll try to make later is that as we move forward into longitudinal sampling of things like for instance RNA sequencing data over time or other deep longitudinal measurements of patients over time, the proportion of these data that will end up being phenotype versus sequence data will increase. So the goals of in this context I think would be firstly to create accurate and consistent variant calls that are consistent across all of the samples within this set regardless of which source they came from. Secondly to harmonize and clean the phenotype data which of course is very difficult. And finally and perhaps most importantly to make sure that the data are not just accessible to the community but are actually usable. So that many different people not just statistical geneticists but biologists and people from pharmaceutical companies can actually access, can actually use these data to address their specific biological questions. So that the four areas of key challenges that I saw emerging from this were, sorry just realized I'm progressing in two different places. The four key challenges that I think are rising this area are firstly in the area of logistics and this is basically just around moving, storing and processing very large data sets. Secondly very importantly harmonization so how we actually pull together data from different sources and make it consistent across an entire cohort. Thirdly analytical challenges which have been addressed I think by other speakers very well so I won't spend much time on that. And finally and crucially issues associated with access and usability of the data. Okay so in terms of the first series of logistical challenges relate to data management. So here we have a petabyte of raw data and one of the first things that we will want to do with that data of course is move it from one place to another. And that turns out to actually be extremely difficult. So even in the era where generating 100,000 exomes is no longer inconceivable actually moving that data is pretty tough even with a very high bandwidth connection, a 10 gigabit connection such as the one that we use at the Broad for moving data around, shifting 100,000 exomes would take on the order of one to six months depending on exactly how you do that. And in fact interestingly it may actually be more time effective to simply fill a truck full of hard drives, one terabyte hard drives, drive that to the facility, pay someone to sit there and download the data into those hard drives, load the truck back up and drive it back to wherever you need to have it delivered. So it's sort of ironic that we're living in this high tech world and yet delivery of data by road is more cost effective. Data storage is not free. I've got different numbers and different estimates of exactly how much it costs to store these data. It depends on whether they need to be accessible from high performance disks but it could cost up to a million dollars a year to keep 100,000 exomes. These numbers will be roughly tenfold higher for whole genome sequence data on average. Importantly though we can actually drop these numbers substantially by applying different compression algorithms and I'll go through one such algorithm later. Now importantly although these numbers are daunting, the community now has very extensive experience in handling large data sets. Not quite on this scale but certainly in the tens of thousands of exomes. All of these logistical problems are soluble. I think they're eminently soluble and most of them don't actually require developing fundamentally new methods. The second set of logistical challenges relates to QC and metadata. One issue of course is keeping track of samples so making sure that the sample data you load at one end of the truck that you've driven across the country is the same sample when you reach the other end. For genetic data in many ways this is easier because it is inherently identifiable in a sense. It's relatively easy to spot sample swaps, duplicates, pedigree errors and these types of things. But some errors will remain invisible to genetic data and of course keeping track of phenotype data and ensuring that it's a link to the correct set of genetic data will require incredibly stringent quality control and sample tracking across that whole chain. There's also additional metadata, non-phenotypic metadata that needs to be very carefully kept track of. For instance we need to make sure we know exactly what each participant has consented their data to be used for so that if in this sort of vast aggregated theoretical data set the data are not accidentally used in ways that are not appropriate given the consent. We need to know in the case that we need to do high throughput validation or look for do biological follow up, we need to know exactly where the samples for those individuals are and of course we need to know whether they can be recontacted for phenotyping and if so how. And as I mentioned earlier phenotype data is likely to massively increase in the near future so that will become an increasingly larger fraction of the data the informatics burden. Now I'll spend a little bit of time talking about harmonisation particularly in the sequence space. The key the need for harmonisation is driven by the fact that both sequence and phenotype data are for various reasons some of which are sane and others of which are not. These data are generated inconsistently between studies and this lack of consistency really hampers and in some cases actually destroys our ability to draw useful conclusions across various studies. It's very difficult as Chris O'Donnell mentioned it's very difficult to approach phenotype harmonisation using current methods and as we start to aggregate larger and larger numbers of studies together that will become an increasing bottleneck and it seems to me that the current approach may end up not being scalable to very large aggregated data sets so approaches based on machine learning or may end up having to move in and while those will be noisier that may be the only way in which we can actually pull these phenotypes together. Now fortunately harmonisation is much more tractable when it comes to the sequence data but the key thing here is that in order to make in order for harmonisation to occur for sequence data data processing and variant calling has to be done in a centralised way. So long as we live in a world where genotypes are called at various different facilities and then pulled together it will be effectively impossible for us to harmonise that data. And the reason for this is shown in this animation here. Let's say we have a data from one study here where our variant positions are shown as rows and individuals as columns these are the sites where each of the individuals is variant across this region of the genome and here just in this column I've shown in black are the sites that are variant within this particular study. Now if we then draw in a set of individuals from a separate study there are certain sites which are variable in both of the populations and in many cases we can actually draw some conclusions across the studies for those sites but there are also a number of sites that are seen in one study but not in another and there are a number of different possible explanations there. For instance we may if this is exome sequence data it may well be that that particular exon of that site was just not covered in one study but was covered in the other. So there's missing data given the way that variant calls are currently stored it's actually very difficult to extract that information. In some cases we'll see a false positive in one study but not in another and in other cases there'll be a false negative in one study but not the other. The only way in which we can resolve these issues is to pull the samples together at the raw data stage and actually recall, reprocess and recall the variants in a combined fashion and that then allows us to say to determine precisely which sites are actually not called in one study but not the other where it allows us to find new variants by sharing information across those studies to find sites that are actually variable but weren't called in the initial analysis. And it also allows us in some cases to identify false positives. So in this case what happens is by sharing information across all of the samples together we can reduce the evidence for a false positive site in one of those studies. Now here I'm showing the pipeline for sequence data processing that's currently used at the Brod. Many other centres' pipelines can also be shown. I apologise this is the sort of standard overly complicated processing pipeline. The key features here are that firstly there is a step that we go through on a per sample level where data are basically processed, analysed and recalibrated and then crucially there's a step in the middle here, the variant calling step where we actually batch together a whole large number of samples between one and n samples where n can be arbitrarily large as I'll show in a second. And the variant calling is then done by sharing information among all of these samples together and that then allows us to get much more powerful estimates of which sites are variant and also to rule out particular aeromones. And then finally in the third phase of the pipeline we integrate together all of the variants called from each of these individuals with external sources of data recalibrate the variants and that then gives us our final data set of sequence variants. It's critical that this step here is actually done on many different samples together. The standard approach now at the Broad is to do this variant calling on 100 samples in one batch. But one of the questions that we've been asking over the last six months is whether this can be extended to much larger sample sets. And so the only way in which this can be done is by reducing the amount of information that's present in each sample. And here you can see for a high coverage genome sequence data this is a read file so each of these orange horizontal lines is actually an individual read piled up across a region of the genome. Now in most regions of the genome say this region here, this individual appears to be homozygous for the reference sequence at all of the regions across here and we can actually relatively confidently say this is a well behaved region, this individual's homozygous reference and just collapse that region down into a single read that contains some quality information but basically it loses, discards a lot of the information, redundant information that's carried here. And then what's kept are only the reads that pile up around these variable positions where we see evidence for a heterozygous variant or perhaps some more complex variant that's present in that individual. But by discarding this information it's actually possible to compress the raw data by an order of somewhere between 20 and 100 times and that then makes it possible to scale variant calling across much larger numbers of samples. So as a pilot analysis to prove a principle of scaling this type of analysis up, I've been working with Mark DiPristo and Khalid Shakir on a pilot test just looking at chromosome one, so about 10% of the genome in 16,500 exomes. And these exomes are derived from a number of different studies including 1000 genomes and then a series of disease specific studies and also healthy controls from each of those studies. And I've cited some of the PIs involved in this project here. So for these 16,500 exomes all of the data has been pulled together, recalibrated jointly and then variant calling has been run across the whole set. Functional annotation has been performed using a pipeline developed in my group. I won't present the data on that in this presentation though. The preliminary results firstly and most importantly I think just generate that this is feasible. The analysis on 16,500 exomes worked for chromosome one and therefore will work across the rest of the genome. It took just under a week to run using relatively modest functional power at least by broad standards. Modest of course is a relative term. So this, so that means scaling this approach up to a whole genome level and to many, to much larger numbers of samples is entirely feasible. In terms of looking at it on a larger scale, we're currently preparing for exome one calling in 20,000 individuals, so adding in a few more thousand individuals from other studies. And that will be followed by large scale validation and the designing of cheap genotyping arrays, in this case to target loss of function variants that are identified in these 20,000 genomes. Which again I won't talk about more but ties into Francis Collins idea of a human knockout project. The key points here is that there are no fundamental technical barriers to scaling this up to large numbers. Although one challenge that we will face is the diversity of sequencing platforms may soon increase as beyond the current Illumina platform. I realise I'm running low on time actually so I'll just skim through very briefly through the analytical challenges. Many of these have already been discussed by Peter Donnelly. He mentioned that I think this is crucial to understand that variant calling is still immature. And I'll avoid the other issues here except to point out that again as I mentioned earlier that having very broad phenotype data will impose a major multiple testing burden that will need to take into account when designing studies. So a key challenge that I wanted to spend the last section of my talk focusing on is how we can actually provide useful access to these data. Excuse me, you have one minute then. One minute, okay. So I will make this very quick. Providing aggregated and harmonised variant calls will of course greatly empower statistical geneticists. But in terms of preparing the rest of the research community we need to consider how we can provide them with tools that allow them to tackle typical use cases. So situation, so for instance they may wish to know which missense or loss of function variants are found in their favourite genes, what phenotypes they're associated with and of course from a clinical perspective which variants in a patient's genome or indeed in their own genome for a consumer are actually associated with disease. And I will actually just skip over this. Terry Monoglio, Terry has circulated some of the results from the meeting that was had earlier this month where various different models for data access and aggregation were presented. I think going through that document that has been circulated is very useful. There are many different models that can be provided and I think the solution here will be to use all of them, not just one. So the key messages then in my final 15 seconds are that a very large scale aggregation of sequence and phenotype data is entirely tractable from a logistical perspective. It will require centralised processing and variant calling, but this is certainly doable. The amount of phenotype data on samples will increase massively as we start moving into a much more complex longitudinal phenotyping and harmonisation of course will be much more challenging there than it is for sequence data. I mentioned the curse of multiple testing and finally the need for substantial investment in new interfaces to maximise the impact of aggregating this sort of data on the broader biomedical community. Thanks. Great, thank you. Comments and questions for Daniel. How many times does a genome have to be squared before you can say that you sort of know what there is in there or not, or do you ever know? I don't think you ever know for sure. So even in particular for more exotic variants, so very complex regions where you have multiple different rearrangements nested within each other, at the moment there's no method that will actually detect those and it's likely that over the next three to six months or 12 months we'll develop new methods for doing that. I think at each stage we'll just need to go back and revisit these samples and recall them over and over. One of the challenges we'll be facing of course is that it means if we have N samples within our data set and we then need to add another 10 in an ideal world, just to add those 10 and get the best variant calls possible for those 10 samples, we then need to recall the entire batch together. So we've been discussing at Broad the idea of basically having a monthly recalling of almost all of the exomes that are held in the Broad centre and so that would then allow us to just continue recalling those samples and improving as each new batch is added. But you must get data at some point of, there must be some exomes that just never change when they go through that process and at some point you can stop doing that. Possibly, that's right. It will be incremental and I think we'll probably reach a point where you're close enough to reality that you can stop. Okay, so we have a number of Maynard, Steven, Thomas, Trisha and Gail. Excellent presentation. I think you caught the right balance between sort of the alarmism that sometimes hovers around this area and naivete. I mean, obviously there's some challenges but they look solvable and I would just say that although I can certainly understand why at the moment this kind of centralised variant calling is essential, I'm more optimistic that this is a transient phenomenon. There was a very similar situation in the early days of, let's say, cosmet or back level assembly in which unless it was all done at one place you just got different results on the same back and so forth but those problems were solved and it requires collective activity in which multiple centres are exchanging data sets comparing their analyses but that is I think the path forward. We don't want to over centralise anything here. I think that's fair. It may well be that in one year or two years time the procedures are sufficiently robust and set up that we can just distribute the code that's required to call variants and that is then done systematically and of course as the data quality improves that will make a big difference as well if we have long reads that reduce mapping areas and very accurate reads that will reduce the impact of different processing pipelines on the final calls. And to correctly identify I believe the key threat to that evolution is diversification of platforms but the only comment I would make there is that there tends to be platform convergence for relatively long periods of time just because it's so beneficial. It's not necessarily that it's like the success of DOS. It's not necessarily the best product but there are just a lot of advantages for many people using the same product and I think that will happen here. I suspect we're just about to go through a period of disruption but things will stabilise. Sure. That was a wonderful presentation. I just want to follow up and ask you to speculate for a minute or so on and related to an issue that we've talked about last night and this morning and that is sort of exponential growth of the kinds of tests and the ways in which we want to look at a very large data set to hold genome with a large number of phenotypes and potentially doing sort of phenome scans as opposed to genome scans. How do you foresee the next two to three years in terms of the computational capabilities of doing that? Are they going to be so large and cumbersome if you just described how much effort it took to do one chromosome, one week, a very distinguished group of people spending a lot of time converging on that. At some point this is all going to be very trivial but it's going to be 5, 10 or 15 years from now and if we do these kinds of large scale sequencing there is a certain degree of everybody being focused on getting results right away and not saying oh it will take us three years to analyze what we want because as a sequence data comes off we want to start looking right away. So can you speak to that problem of only a few people as opposed to a lot of people and how much it's going to take to transition to having many do it? My sense is that it's incremental so there will be the exome sequencing that we do now is far from perfect. There's a number of, I mean Peter can certainly testify to the fact that small insertions and deletions are still called very badly really from exome data. That's improving and will be much better in one year or two years time but at the moment even with the data quality that we currently have it's still possible to use exome data in very careful ways to get clean answers for specific questions and in particular in the rare disease space it's been incredibly effective even though we know we're missing some variance. So I see it as a kind of incremental process. There will be some low hanging fruit that we can answer with existing methods as we refine the methods and increase the complexity of the phenotype data and the accuracy with which we call sequence data. We'll be able to move higher up the tree to the higher hanging fruit. I don't think there's a sense in which we need to wait. We can get quick wins now on the methods we currently have and then you will incrementally progress from there. I'm still blown away by the sheer volume of data and the complexity of handling the data and you said these challenges will be solved but can you speculate on how we avoid the truck and to save how we can avoid the company how we can handle these problems. Not to need a truck to deal with the data. Oh I see. That's a transfer problem. It's a big problem. So there are various there's certainly a huge amount of effort currently underway in increasing bandwidth and again I think this is a problem that in a year or two potentially we'll get fixed. In other countries which have better underlying networks it is actually to some extent solved but... So is there a technological path forward or does it need a disruptive technology before we can do it? As I say this is a field I don't know incredibly well but my understanding from speaking to people is that there is a path that's being followed and certainly there's a huge amount of investment currently being pushed into that type of area. There's a lot of genomics is just one area where moving lots of data around is important. There are areas that are far more commercially important where moving lots of data around is critical so I think that will drive innovation pretty rapidly. But I think for the project, the kind of project this is very relevant to think about. It's probably worth adding that it's the raw sequence data which is huge. Most users would be happy with just the variant calls and then much, much smaller files. That's right and PTC are critical points. So this initial the truck that we have to drive across the country is only for that raw data that then needs to be processed into analytical records that are much easier to move. So in the context of the 1000 Genomes project very few people would download the raw sequence data from that project but there's been very wide utilization of the variant calls that have emerged from it which are a file that's a few gigs or a couple of hundreds of gigs. And then Gil. You told us last night to start worrying about the multiple comparisons on the phenotype side so I did. Maybe, here's a question, maybe it's not as bad as it looks at first blush because at least on the phenotype side if you think of all of the conditions we have discussed this morning it's not like a snip here and a snip there. There are relationships so don't you think that if we use family wise approaches and hierarchical that it's not going to be an overwhelming problem. Disorder of a heart is related to another disorder of a heart more so than do or not. So are you really scared because it seems to me a tractable problem? Well it depends on what we think so there's certainly many correlations between these phenotypes so an appropriate correction method will take into account that correlation structure and it's not as though we'll suddenly be doing this stringent bond for any correction across every possible phenotype that we look at but even so humans are just fundamentally incredibly complex creatures and we have, there are lots of different things going on with us that need to be measured and that aren't necessarily all just aspects of just a few different mental phenotypes. So I think particularly as we start thinking about things like longitudinal RNA sequencing data or metagenomic data these very complex data sets that in and of themselves constitute many many different measurements that's where multiple testing starts to spiral out of control incredibly quickly. And again it's not intractable it just means it will be very hard to use these cohorts for discovery but they will still be incredibly powerful from a validation perspective and from going in and saying I have found a variant in another cohort that I think is associated what is an unbiased estimate of the effect size of that variant then you can do that's where these cohorts are incredibly useful. I do yeah I could go on a great length but I won't I'll try and be brief there's a lot of model thinking about multiple testing. My own view is that the kind of classical statistical way of thinking about it is for many purposes but in this specific discussion it just seems bonkers to me that if we've got genome-wide data on some individuals and I'm doing a study on heart disease of the phenotype and Rick is doing a study on some cancer as a phenotype that somehow or other the fact that he's doing a study means that I have to use different p-value ratios it's completely bonkers. There's a real danger that we'll let the multiple comparison tail wag the dog and so on. Daniel thanks it was a great talk. I was wondering if you could address what are the current limitations of sequencing that we should really care about like HLA or trinucleotide repeats and what you know whether some of those things are going to get fixed what's unfixable what's the timeline. So HLA is a nightmarish region of the genome that most of us try not to venture into if we can avoid it but there are certainly people working on that there are particular regions of the genome that are nasty and in some cases I think those will be there will be methods that are developed particularly improving the reference the reference sequence that will make them a bit more tractable in other cases there are regions that are just so repetitive that with the current length of reads it will be impossible to sequence through those. There are particular classes of variants that we're still not very good at calling so small insertions and deletions are still a challenge particularly in the medium range that is definitely tractable so the in fact I think we now understand to a much better extent the causes of the problem and there's an issue associated with poor modeling of the errors that occur there that I expect I'd be interested to get Peter's take on this but I expect that we'll be doing far better on calling of small indels in the next six months to 12 months absolutely and then there are other more complex classes larger insertions and deletions we're still not very good at but again I think those will improve with longer reads and better technology and some algorithmic improvements. So HLA is sort of important for us so I'm curious besides the not wanting to venture in part yes you know what are their potential solutions to that region. So I don't know basically because I've tried to avoid looking at it I don't know but I know there are people working on it. It's tough I mean you need very long reads and even that doesn't really help and we've been doing some four or five four sequencing just in HLAB and it's really very very messy I think you know PacBio or one of these much longer read single cell technologies is really what we're going to need and those aren't ready for prime time yet. You can't do it with SNPs and you can't do it with short reads you can't make sense of B and C. So my suggestion is the HLA people get together at the break. So thank you Daniel and you could pass the slide exchanger to the left and our next speaker is Nancy Cox Thank you.