 All right, so I was tasked with going through some of the challenges and opportunities that would come from processing sequence data across different projects. And so just want to summarize some of the discussions we've had around this topic. And so I guess the first slide is just to say why. I guess the point is that there are many analysis that one can imagine that would benefit from combining information across different sequencing projects. At a high level, some of the examples, you could imagine that if you have different studies that include the same traits, doing a math analysis across those studies will nearly always be more informative than analyzing a single sample or a single one of those studies. As we try to explore the role of rare variants, it becomes really important to have very many controls. I'll show just a small example in a couple of slides. And obviously we could say that every single project should type thousands of controls, but I think there's also probably room to say that if we can coordinate analysis across different projects, we can extract many of these benefits in a less expensive way. As you imagine a situation where there'll be tens of thousands and eventually hundreds of thousands of sequenced human genomes, you can imagine that this will provide very high resolution information about the role of natural selection. You might search through even quite small functional elements in the genome and say, you know, which ones of those never vary or only very rarely vary when I look at hundreds of thousands of individuals. And I guess one challenge is that even modest differences in how sequences are analyzed in each project can make some of these analysis a bit more difficult. And so before discussing the options in a little bit more detail, I just wanted to give two very small examples. So there's a gene called complement factor age that's been associated with macular degeneration in many studies or actually the region that includes the gene has been associated in many studies. And earlier this year, Sumi and Raishaduri working with some of the people in this room showed that the particular rare variants in the gene was very strongly associated with the risk of macular degeneration. And so it turns out that we had some sequence data for the region, and so it was interesting to see if we'd be able to recapitulate this finding in our own data. You know, our own data included about 2,400 AMD cases and 800 controls. And if we look, we see that this exact same rare variant that was described is seen in 23 cases and zero control. So that seems very appealing, right? Yeah. Wow. But it turns out in this particular dataset, because there's so few controls, the p-value is not so exciting. So if you had sequenced many genes or even if you just have many variants in each gene, you know, this doesn't really stand out. It's only in the top, you know, 0.3% of variants, but there's thousands of variants. Now if you were able to look at data from other sequencing projects, for example, if we checked, hey, what happens in the 12,000 exomes that were used to design this exome chip, then you say, ah, in those 12,000 individuals, that variant is seen only twice. So seeing it 23 times in 2,000 disease cases becomes quite impressive, right? So, no, obviously you'd like to do this in a systematic way, you know, in this case we already know the answer, because Sumi and colleagues did a very nice analysis, but there's probably more variants like this defined that will inform about the function of other genes. Okay, so that's one side of the coin, why we'd like to do it. You know, another side of the coin is that there are challenges if you don't do this in a careful way. So many of you have seen QQ plots for all sorts of comparisons. It's common to show them in genome-wide association studies. And one of the things you'd like to see in these plots, I apologize if they're very faint, is that you hope that variants fall close to the 45-degree line for most of the variation. In this particular slide, we're comparing exome sequence at two different centers in a particular large project. There's a set of variants that were judged to be of lower quality, and those show many differences between centers. But even, you know, something like half a percent of the high-quality variants are very different between these two centers if you take the initial calls. If you call them jointly and spend a bit of time filtering them, then you get an analysis that falls much closer to that 45-degree line where there's few spurious differences. Obviously, being sequence at one center or another in this particular project was not a phenotype. So if we just said, you know, let's compare results between studies, probably most of the differences, most of the things that would be highlighted, would be dominated by data processing artifacts. Okay, so what are the options in trying to combine data across studies? You know, there's several options. You know, you could imagine taking the simplest option where each project decides how best to analyze its own data. They're certainly virtuous to that approach. If you spend time generating the data and thinking through it for many months, you can probably do a very good job of calling it, maybe better than someone running a central analysis pipeline could do if they've only just had the first look at your data. Now, even for this, to have benefits, I think we need to spend time making sure that data is in standard formats, is stored in ways that are queryable and that facilitate analysis across studies. There's some opportunity to do these analysis in a more planned way. We might imagine having minimum standards for what we would like to see deposited. We'd like to see certain kinds of variants analyzed and stored, for example. We could even imagine making these standards quite strict and specify that certain analysis still should be used. And this would be, have the virtue of making these data sets more similar. You could imagine then yet another layer, which would be to actually recall all the studies or certain groups of studies centrally, and this would make the analysis the most uniform. So if we start with option one, you could say, well, let's just use the calls provided by each project. Now, I think even though this might not sound super-appealing, certainly in a workshop that's titled, what kind of central analysis can we do with sequence data? I think there's many valuable analysis. They're actually relatively robust to differences. If, for example, if we decided to focus on method analysis where we analyze each study independently and then combine results across studies, those analysis can actually be very robust to what variants are called in each study, whether some study had more or fewer false positives in their list of variants. And I think to be fair, most of the benefits from the ability to do these joint analysis, we're not realizing yet, you know, not just for sequence data, but even for genome-wide association studies. I mean, it's still quite a cumbersome process to combine results across a few studies. Even if you just say, you know, I'm going to have low ambition, I'm just going to analyze each study, do methanels across many studies. I think there are many things that would be nice to see happen to make this more practical. You know, these range from harmonizing phenotypes, making sure that data formats are standard and consistently used. You know, every time you decide to tweak a little bit for an individual study, for example, how you lay out the data in dbGaP, it means that, you know, any analysis that tries to combine those data sets now requires manual intervention. It can't just automatically scan across different data sets, even if you have access to several of them and try and combine results. Obviously, we talked a lot yesterday about streamlining data access protocols, and I think that's also very important. So, I think all these things, you know, how we store phenotypes, how we use standard formats, data access, basically there are many things to do here that have nothing to do with centrally processing data that have potentially very large payoffs in terms of models for making it easier to combine data across studies. You know, I think there are differences that we have between studies that we could probably try and reduce, and I think this could be helpful. You know, you could imagine that when data is centrally deposited, you could find some minimum standards. This could be, well, we'd like to see analysis of indels in addition to SNPs or certain types of copy number variation. One important thing is that many sequencing studies nowadays report variants are discovered, but it's very hard to interpret the absence of a variant. Is it because the region wasn't well sequenced, or is it because the region was very well sequenced and nothing was there? When you think about using a sample sequence by other projects as controls for your samples, it's really important to have this detailed annotation of what could have been discovered in a second project. You know, I think you could even imagine requiring that all data that's deposited is processed at least once with the same set of tools. You know, in my view, I think that even if we spend time making these things happen, I see the improvements as actually somewhat incremental in relation to option one. I think if each data set is processed separately, even if it's processed within the same tools, there will be still important differences between data sets that make it impossible to directly combine data. You're still limited to doing meta-analysis kinds of approaches. You know, for example, it's common now in sequencing projects to use a series of filters to remove variants that are likely to be artifacts. Those filters depend quite a bit on things like sample size, so if you have a large or a small project, the quality of variant calls is probably going to be quite different, even if you analyze them with the exact same tools. Option three, you know, what if you try to jointly analyze data across many projects? Obviously, this would be the most computer and labor intensive, because it requires things to happen that wouldn't happen by default. There are many analysis that you could imagine benefit from larger sample sizes. If you're trying to discover new variants as you have many more individual sequence at a particular position, it's easier to discover variants where evidence might have been marginal across many samples. If you're trying to resolve complex events, break points for structural variation, and things like that, it's really, really helpful to have sequence information across many samples. If you want to resolve haplotypes, if you want to decide if a variant is real or if it's an artifact, if you're trying to decide that this doesn't smell right, because whenever I see support for it, it's always on the same strand. In any one study, it's hard to be sure. It could happen for any variant in any one study, but if you see the variant in many studies, and the support is always in a particular strand, then it becomes quite dubious that it's real. If you're able to process data centrally, and there's a new analysis that's possible or new set of analysis tools, you can allow those benefits to percolate to data that was generated previously in the cross studies. I think right now, we're actually in the situation where it's technically feasible to call tens of thousands of samples. Mark, the professor outlined how they've recently done this for 16,000 exomes. I know internally we've called something like eight or 9,000 exomes at a time. So we're very close to those numbers. And so I think this is really feasible. And it's especially so if we're willing to be happy with 80% solutions. When you imagine, for example, calling data that's been deposited across many projects, there's always issues with dealing with legacy data. If you say, ah, I need to include all the data, that might mean, well, you need to include data on solid, which is a platform that's now obsolete and no one is developing, or might be. You need to support very short reads that are only 30 bases or so and that are now no longer generated. So if you're willing to say, I'm going to set aside big swaths of the data that greatly reduce the complexity of the problem, then this is probably quite doable. To make this doable, you really need to spend time on how is it possible to access sequence data across many studies. If these analysis require manual intervention, they become much, much harder. And we need to really be realistic about what the challenges of dealing with corner cases are. Are we willing to say that we will drop many types of legacy data because the reads are too short, because the higher rates are a bit higher, because they use platforms that are not being actively developed for? And I think quality control for these kinds of analysis is also quite important. One experience that many people in the room are aware of in the 1,000 genome project, a few months ago found out that there was a set of five or six samples that had sequence data with a little bit lower quality. And these accounted for 5% to 10% of all our indel calls in the project. So a very small number of samples, a little bit lower data quality. And until we had the right filters in place to pick them out, they were distorting results for the whole project by quite a bit. I think related to these ideas, and I think something that's worth thinking about, is that there are certain types of information. And the little frequencies are one example. But there's a few other types of information that can allow many of the benefits of joint calling without requiring you to share raw sequence data. So you could imagine having some distilled view of the haplotype structure of different samples you've sequenced before, and that if you now sequence a new sample and try and place the sample within that haplotype structure, you'll be able to call it much better without actually having to access the raw haplotypes for all the previous samples. You could imagine that if at each base you annotated, what's my current evidence for each possible variation at this site across many thousands or tens of thousands of samples. Now when I look at data from a new project, I can use that as my prior for variation. And that will improve the quality of my calling, even if I don't look at the raw sequence data for all these previous projects. I think the risks of sharing these derivatives are similar to those involved in sharing a little frequencies. And if we do move ahead with the idea that it's OK to share a little frequencies for very large sample sets, we should explore what sorts of these derivatives can be generated and can be easily shared to facilitate analysis of new sequence data. OK, I think this is my last slide. And so I think I'd expect that all these options actually are likely to be tried first, not in a central repository setting, but probably by investigators that have some shared scientific interests. If you have a group of investigators that's studying, let's say, the genetics of lipid levels or diabetes or schizophrenia, they have a very strong incentive to say, hey, what happens if we combine data across our parallel projects that all have information about this trait? They have a strong incentive to say, what happens if we just take the raw calls? What happens if we try and analyze the data in parallel ways? What happens if we try and actually put the data in one place and analyze it together? And I expect that that's where we're going to see these analysis piloted first. I think currently, even if you look at the simple things that can be done with math analysis, we're not really explaining what can be done with data that's, for example, deposited in DBGaP. There's probably not that much sequence data there yet. Lots of this data is still being generated in the process of being deposited and so on. But there's tons of data on genome-wide association studies. And even when those studies look at have traits in common, most of the time, there has been no math analysis, even a cursory one, to say, what's our current status of knowledge for a particular trait? Now, I do think there are exciting things that can be done if we combine data across projects. And I think this is worth pursuing. So I think this is basically the summary that I had. Yeah, David. So I start to get the sense that data processing is almost like an art, right? If you do it and I do it no matter how hard we try, we won't do it the same way. You may use different files. And so while you say one part, we will be getting to these centralized servers. Another thing that's a variation of the options you described is pre-bundled packages of standards in which you hit run and you get the same result, I get the same result. And what that requires is putting together a hodgepodge of tools and getting together the licenses, but to have a release, not of a genome, but of a variant calling approach. Has there been any conversations about not just standardizing in terms of a recipe, but standardizing in terms of a pre-packaged set of options and all the other things go with it? Right, so I think that conversations about that are important. And it is hard to translate these calling platforms between centers or between different analysis sites. And that's one of the reasons that many people think the cloud is attractive. If I can define some calling process on the cloud, I might have a virtual image of what's required to those calls, and I can then share it, and you can plug in your data and pay for the compute time and call it with the exact same process. I honestly think that right now, if you look at where we are, the biggest issues are not the differences in calling. There's many analysis that we could do despite the differences in calling that right now don't happen because it's hard to access the data across projects and so on. I think if we just made sure that everyone called with the same set of tools, we would still have all those other issues to resolve. But I actually think if we were resource unlimited and had the software engineering bandwidth to produce tools of the sort, like someone asked earlier yesterday, does this have to be centralized or could it be possibly a distributed model? And I think what you're saying is right, that in some sense if you had tools that were sufficiently transportable that they could go to different people's settings, run on their computers, be sure to give the same output, you could do some of the processing in a distributed manner and combine if you could, in particular, share the information on error modes and sites that you get from large collections, which is part of the different outcome we might get. In other words, it might be that running the same exact software tools on 10 batches of 1,000 samples does not give the same answer as running that same set of software tools on 10,000 samples because of the borrowing of information. But nonetheless, I think there's a implementation challenge as there is even for creating one, for any given group to create one such environment to then package their tools so they can be guaranteed to run on anyone else's computers, to have someone talk about help desk support, so that when someone puts it on a machine that's not fast enough or not configured correctly, it runs. You know, I mean, it's not that it's not a good idea, because I think it actually would be a better outcome. It's just that much hard, to my mind, that much harder to implement. I don't think it's that hard to put together the set of tools we use, tar them up. You know, most people use Unix-based computers and to have input of a fast Q and output of a VCF by which you've at least stabilized the version and some of the things used along the way to build an annotation. This gets complex because these tools typically are not designed to process one sample. The challenge is, if I have, you know, 10,000 samples to process, I'm going to process them in some clustering environment. And my clustering environments are very different from David's and from yours. And I have to coordinate how jobs are scheduled and how they're divided and they have many moving parts. I think that's part of it. Mark? Well, I think, you know, that this particular aspect of the implementation data processing could be done in such a fashion, but I don't think that that's really responsive to, you know, the mission of this meeting to create a repository whereby we can share data and benefit across studies. We could develop the types of, you know, clinical querying systems that benefit from tens of thousands of genomes having been sequenced and interpreted in one place and centralized tools for the community of people that aren't us who could possibly do all that data processing ourselves, but really need, you know, some centralized tools to access the fruits of our labors. So, do you get requests for interpretation of individual genomes in clinical or research applications that you are fitting into your process and using your variant callers for, Gonzalo? You know, we do get such requests. I would say that it's really not my area of expertise, so I think, you know, how we, you know, how we, you know, we have limited ability to answer those kinds of requests. But I guess the more general question is, is the aggregation that you're talking about directed towards that problem in any way or should we just sort of set it aside if we're going to start to think about that problem? Okay, so I think if you're able to aggregate many genomes and analyze it systematically, it informs many problems, and I guess the problems that I highlighted at the beginning were mostly research problems, how do you discover a variant that's associated with a particular trait and so on, or how you discover regions of the genome under selection. I think it also greatly informs, you know, annotation of individual genomes. It's different if you see a variant, if you know that it's different if you know that this is a gene that commonly varies with, you know, similar types of variants, or if your variant is unique for that gene or for that functional element. But, you know, interpreting an individual genome is, I think, is incredibly challenging right now. You know, I do spend a little bit of time talking with people who don't know what's going on, and I do spend a little bit of time talking with people who don't know what is going on. I do spend a little bit of time talking with people who do this for a living and, you know, and I would say that for most kinds of things, and for most genomes, even when they know there's a Mendelian disorder in the family, it's not clear how you interpret what you see. I think it's not really responsive in some way to Richard's question. It may be that today it's challenging to do, but I don't think, and maybe everyone in the room has different points of view, but I would say that at least in some cases, yes, this is related to the challenge of how you'll do, that people do ask for that, and if we think about how to create such a system, it's going to have to involve how you compare that genome to other genomes and reference knowledge bases that you have. So, Mark and Carlos. So I just think, you know, we have been doing that, Richard, with the, you know, Mendelian families and with, you know, some individual cases that are colleagues at Mass General of Sequence in terms of combining reusing Mark's tools in the system that he described to reanalyze those exomes and have gotten, you know, really helpful results on a technical level that we then tarry forward into those types of analyses. So I think this is something we should be thinking about. Carlos. So just 2.1, I guess in response to Richard's question, it's certainly possible that a central analysis server could serve that function where you act as a broker for people who have clinical data, you could clear the database, has anyone seen this variant, and you could even get information on what study that variant was seen in so you could contact potentially, you know, if you have a clinical reference laboratory that could be connected in that way. And I think to the question of aggregating for doing pipeline analysis, in my mind it's similar to the early days of genotype calling where if you wanted to call the rare variants you had to have, you know, enough to train the cluster, but eventually I think some of this will stabilize once you have 1000, you know, sequence genomes then in fact I don't think there will be that much variation from analysis to analysis for the N plus 1 genome. So I think there's a sort of transition phase that we need to think about, but then longer term we have to go towards being able to do clinical interpretation in what will eventually probably be a HIPAA environment so that people can get results back. And I think that gets to, there are two issues. So there's joint calling and there's a real value to that and it's not taking away from that value. But then there's also the aspect of smaller numbers that need to be called as samples. Yet it's still important to share those and essentially put those together because I don't think people will redo joint analysis with every new sample. So there's two things that I guess I'm focusing on, not taking away from the importance or value of joint. One thing I think is worth point, I mean there's two issues with joint calling and I think, you learn about the alleles from one sample and genotype in another. But also you get enormous improvements in specificity because errors that, sites that are errorful in one sample, you often gain lots of power to identify as an error. And in fact across multiple samples you have a common error. And that is actually what I worry about a lot with clinical sequencing. If you're doing single samples then what you're really doing is not just missing some small number of variants because you can trump that to some degree with depth it's that you have systematic artifacts that really can only be removed by looking across many samples. So a good example of this is cryptic duplications that are in Hardy-Weinberg disequilibria across lots of samples. You could never know from a single sample that this is happening. So I think the other point is, I think Carlos's point about the N plus 1 genome is actually quite important because if you do have 100,000 sequence genomes there's probably a set of summaries of those that you could imagine deriving that makes it easier to call the 100,000 first genome almost as well as if you analyze it jointly with all the other ones. Yeah, I just don't think we should assume that there's two modes, one of which is we do joint calling and one is we do one-off genomes any more than, I mean again, we live in a world where like five minutes after anything happens in the world it's completely webcrawled to every one of us. There's no doubt that systems could be developed, again whether we'll invest enough to develop them such that maybe not every N plus 1 genome but every night or something like that you figure out how to update the analyses you have the value as if you had done calling on all of them and you've added in incrementally otherwise I think we'll be delivering a much worse clinical product than the research product because we won't be benefiting from all the things we've learned.