 All right. So my task was to talk about analysis and tools and portals. So we're going to be shifting gears a little bit because what I want to talk about is given the sequences in a central place, variance, recall, aggregated metadata, phenotype data harmonized, what tools are needed to make the data useful for the community. So I know conversation tends to come back what's possible given the data restrictions. But I think it's important to explore what's possible in an ideal world, how we could make the data, so who is our user base, what analysis we can do, and how to make this most effectively. So the first slide here is just an incomplete laundry list of the type of analysis we can do, given the variance. Steve already covered some of these. We can provide phased haplotypes, aggregate allele frequencies across populations and across the whole cohort, calculate variant burdens in various ways. We can provide functional annotations at coding and non-coding parts of the genome, lots of function analysis, including information from disease databases. We can redo GWAS analysis in various ways and go to higher order analysis, system biology approaches to network pathway analysis, and so on. Some of these analysis are well defined, some are pretty open-ended, some are easy, some are very hard algorithmically, and accordingly some tools are available for these, in some cases the tools are still emerging. Some require a lot of computation, some are very easy to do, and don't require huge computational burden. In order to prioritize which of these analysis are important to us, the most important is who are we trying to serve with all these analysis and the tools that we are developing. Because the analysis that were described in the previous slides clearly mean different things to different people and answer, and people are interested in very different aspects of the data. So David Altshuler made the point abundantly that statisticians and tool developers are already very well-served, people like me, we wouldn't have this project for people like us. Medical consortium project analysts and drug developers will benefit, but they already are fairly empowered. So really the target audience, in my opinion, again starting downstream from the variants is really biologists and small laboratory students, post-docs, and clinicians who don't necessarily have the expertise and the resources for taking advantage of the data. So in this vein the question is how to make our analysis accessible. First thing I think the key aspect is things have to be easy. Tools would have to be easy to install, easy to use, they have to be intuitive, preferably fast, and if they fast they can be interactive. I sort of highlighted web-based, which is not the only mode to accomplish this, but it's a very good way to accomplish this, and one that's for which modern technologies exist. HTML5 is making real impact on being able to access even a large amount of data and locally render images with new web-based technologies. This is only one side, the other side is even if these tools are very easy to install, there are some other impediments from small users to use the data. Some of the larger analyses require huge computational and storage resources, but users will have access to basically laptops, resources on this scale. Some of these, using these tools and running these tools require groups of analysts who are highly skilled in bioinformatics and computation. What we want is people whose expertise lies in other areas in experimental biology to be able to use the same tools. So with this in mind what are the ways we can and hopefully will in the future be able to provide the data and the analysis to the community? Well, rather to download is sort of the baseline. Everybody understands that's been the mode of operation, easiest to accomplish, hardest to go afterwards because now you have to store and compute the data, install the tools, all the stuff I talked about. Query portals, viewers, data slicers, Steve covered that in great detail. I think that's really is making an impact now. For example, the thousand genome data is now available through these data slicers from EBI and NCBI as well. I think that's a very good way to get the data to the users. Static variant annotations, pre-computing your resources. Again, Steve talked about that. That's also very important, but it's limited because many of the analysis will require more dynamic content and I'm going to come to that. So this fourth is I think for us to, for the way forward, where you have central data, for example, all the sequences from the medical projects, you bring tools next to them. Users can upload their own data, analyze them in conjunction with the central data set and you provide an environment in which the tools can be run. I think this is ideal. This is my opinion where things are going. So parsing out what are those areas where static analysis makes sense and there are a number of these. Clearly, variants, variant annotations, sample genotypes, fairly simple data types that are easy to compute and they general. Face haplotypes, a very well-defined data type again, basic per-variant annotations and I'm going to come back to this. Method data, phenotype information, again, things that are static content. In order to make these available, you need better query tools and again, Steve covered this I think very thoughtfully. There's a great need for being able to query and browse phenotypes and there's very little available in that area. Also need sophisticated viewers. One example is just looking at the variants. So current viewers are basically either genome browsers or seek or alignment viewers. But what's important about the variant is their structure and what they do to the genome. So here's an example. This is a sort of mock-up where a deletion introduces, so the structure of the variant that is a deletion. It deletes some of the chromosome and what it does to the genes is it makes a fusion gene. So this level of visualization and interaction, in my opinion, is essential. Also if we were to provide haplotypes, how to browse the haplotypes and phenotypes as I talked about. So what does are more dynamic that it makes more sense to provide tools to analyze these dynamically and provide customized content. Turns out that re-mapping and variant calling for users' own data. And this is ongoing. There are a number of projects funded by NHGRI. The Broad has one, Gonsal and I running one of these, is actually aiming to make these tools available in such an environment in an integrated analysis environment for users. Imputations, so we've been talking about the possibility of a central hepatotype resource and associated computing infrastructure that allows imputation of individual data into the central hepatotype resource. Customized functional annotations, these are not practicable to compute. You can't foresee any possible structural variation, genome rearrangement or deletion. The aggregate effect of phase variants, this is a slide from Daniel McArthur. So we often see in the 1000 Genomes Projects that consecutive indels, if considered in separation, they introduce frame shifts, but they always observe an individual in phase, in which case the phase is restored, the frame is restored. So these are the kind of things, again, that you can't really pre-compute. You need dynamic computations for these. Turns out that this category has the highest tool development cost because you're not just making these tools available as command line tools, but there's more infrastructural development. Do we need one tool or do we need multiple competing alternate tools? This is an example and the specifics are not particularly important, but these were comparisons from the 1000 Genomes Projects somewhere along the way, maybe a year ago. These comparisons are fairly common. If we look at the same exact dataset, process them with a number of different tools, in this case we were calling them with some experimental tools on chromosome 20, and then we compare these. And what's important about them is that if we use the results in combination, say, for example, take all the variants that are called by at least two different tools, the quality of the dataset improves. So the point here is that there are inherent advantages to having alternate tools available, centralized or distributed development. The point, and I think a very important point was made yesterday, I think maybe Mike brought this up, that tool development is iterative. Once you get an answer, you want to ask a new question. It's not possible to say, okay, we're speccing out, this is the information we want to get at. And then two years in a line, the tool appears and that satisfies everybody. It is really an iterative ongoing process which requires flexibility. So users are often better served by light, flexible tools for customized analysis, than with large, more monolithic systems. And this, so for example, I'm very partial to skiing and what you see here is all the available snow report apps on iTunes. I use some of them, some of them are great. Some of them I really don't care for. But the point is all these are looking at the same data. And the ones that provide good extra mile allow people to give up to the reports, they tell me, yeah, this is great. Come out or that's not where we're going. Those are the ones that survive. So providing tools in this way makes sort of a competitive mix for a competitive environment where tools are driven by demand-side economics, as we heard from Lincoln yesterday. Who would develop the tools? In our experience in 1000 Genomes Project, which is, in some ways, we can look at it as a very, as a large tech development project. We see a balance of tools coming from larger groups, large genome centers, and a large, also a large cottage industry, successful tool developers. And what we find is that even fairly small informatics tools can produce very sophisticated software, high-performance software, and respond to, namely, to user needs. So last slide before we get into the discussion. So if I look into my crystal ball, I would say we need to focus on the cloud because that's the right combination of things we need. Not necessarily for providing big central analyses, but for the biologist that doesn't have the resources, because it gives you all the components. I think we should build an open environment for tool deployment to pull in the widest possible developer base and let tools compete with each other and let these tools follow demand. And this is not impossible. That's where technology is going. Models exist, for example, the iPhone apps that everybody is using. And with that, I'll open up for discussion. Yes. Excellent presentation. I really like the analogy of the app world. My question in that case is that where is the platform? Is there one platform or multiple platforms and who is developing those platforms, right? Because you say iPhone. I think iPhone is in this case a platform. Who will be the genomic platform in this case and is it one or multiple? I don't necessarily think that this would be iPhone applications where we analyze genomic data, although you never know. There might be some questions where that's perfectly sufficient. David. Well, I had a comment or two questions. One question is, yesterday Mark presented the analysis server with apps and there was this generally negative response. And now I'm hearing positive ideas to apps. I just want to draw the line. In some sense, for the apps to run, there has to be a platform and there has to be data they can access. So is it the same or a different idea you're proposing? Well, I think the idea that here is that as long as we can put together a platform, I think what's key about this is that you have freedom to put in applications. Any developer can go and answer a question that may be of interest to the users. This is exactly what Mark said yesterday. And I'm just saying, I actually think it's a great idea. But there were other people who said it's intractable, it's impossible, we shouldn't try and do it. No, I think the difference is that there's multiple ways to look at it. A, you can say, OK, we're going to build something that will answer these specific questions. Then it's not. If we say, OK, we're just going to make a platform available, all the data that you would need to answer questions that may arise this year, or we already know, are some that arise next year. And let. So the apps have to have all the security and all the things. But the question I had for you is, you discussed a world in which the biologists and the clinicians want access to things like variants. But they want answers to questions. You didn't really talk a lot about analysis of phenotypes. You talked about how you think about that. Absolutely. I don't think to a clinician, variants are horribly useful because they find, they say, OK, I see this. They send off for a clinical sequencing test. They find, oh, there is a variant in this gene. And they may know that, OK, there's another variant in the same gene, and that causes phenotype. How do I alter treatment? I completely agree with you. So that's where the app comes in. So that's data integration apps, looking at it useful ways that can provide the clinician with specific information. These applications can be certified. They can be certified to satisfy various requirements to be used in the clinic, and so on. I think the reaction, David, just to clarify a little bit that the distinction between Mark's presentation and Gabor's a little bit, to me, was this issue of vetting the applications that come in. And the idea here is that, correct me if I'm wrong in this, but you presented this such that this would be for the average biologist, and there would still be an opportunity for people to develop, let's say, their own applications and download the data, move it locally, and actually do the compute. So that was said yesterday also. But there was this thing about, well, what's the criteria for vetting an application? No, but he just didn't talk about it, but it would have to happen because otherwise, you have the same problems everyone raised. He just didn't happen to talk about it. One word that was used, and I think Mark, you mentioned, it was scalability. So certain types of applications, discovering of certain types of structural variation are very computationally expensive. And if a decision was made that this is too much of an investment to have an app that would run on the central server, it would still want the option to be able to take that data and actually do the analysis locally, as opposed to having things vetted because it's not scalable for. No, no, so first of all, yesterday it was always said that, of course, the baseline, the data is always available for anyone who wants to do it. We're trying to enable people who aren't able to do it. But second of all, the board didn't talk about the fact and no one asked, well, if you had that app and it was on the machine, but it would be computationally so expensive, who's gonna pay for the server? Who's gonna ensure the security? It's just the two discussions differed in the focus on what it would take to achieve it, but the idea is exactly the same. Yeah, I think there's no, at least from my perspective, there's no disagreement. It's important to have some type of central analysis, as long as you don't preclude what might be considered more idiotic or more involved. The inviolable thing is that people can get data and analyze it. It's just the community people who can actually do that as small. The real problem is how do we help everybody else? Because already we're gonna streamline so we can get data. And there are ways to do the more heavy lifting style things. So it's possible that that's where centralized servers, say for variant calling, other people's data and mapping would come into play and that would not be an app. Would you also introduce this idea of this competition that would go on among the apps? So if you had multiple apps that were doing the same thing, that they would evolve, because whatever people used the most would ultimately win out. This competition as opposed to some group of individuals saying this is the standard that we're gonna apply. That's more than evolving. It has to be dynamic. That's what was said yesterday. You want that. That's exactly why you want to have the apps because you need diversity. I think the only question is just in an environment where apps could potentially access data that we need to protect. You have to have some quality control or some technical mechanism in place so those apps don't just open a connection to a server and upload all the data. You have to have some vetting, but it could be very fine grained of like, this is scalable. This is not scalable. It requires special infrastructure to run. I think you just, we have to articulate what tools require and what that vetting, what the barrier would be for running apps, say, for free. Yeah, I think the discussion of a sandbox, right, where people would actually do that and lots of the kinks would be worked out. It seems like a really good one. But what the opportunity here is that you can deal with data access on the level of the developer because you can control how the software, what the software access is so you don't deal with it on the user end. So it's not that the user, through this app, the user can't access anymore what the app allows the user to access. But that's the vetting question. That on the base layer, we can have the environment control which data sets and stuff a particular developer can access but what the applications use out the other side, that's gotta be vetted if it's public. Otherwise, the application has to be run under the security framework as well. I think that's what happens with security-wise but iTunes, that's the applications. You can't just put an application on iTunes. Right, but there's been a lot of experience that iTunes doesn't do a good job of that, right? There've been multiple cases this year where Apple's had to pull apps off after users have pointed out that the app is actually inappropriate. So I don't think we can keep using Apple as this. No, absolutely, but what we can in the sense that for the most part it works very well. I mean, out of thousands of applications. But from a security perspective, I don't know it's the best. The one thing I was gonna say, several people mentioned clinicians using these apps and I think that there's a whole different clinical infrastructure for decision tools and how they have to be validated. So I'm not sure that's the best model. I would get back to David's question again about there are thousands of researchers who work on these individual genes who have very important questions that we need to make sure they can answer. And many times they see that there's a variant and they have no way of knowing whether that variant has any evidence of being biologically important or not. David. This discussion really points out to me, I think the importance that we are very active in recruiting completely unrestricted data as part of this exercise. If we completely ignore the category of data like George Church's PGP data, then there's nothing that's in an unrestricted environment that the app developers can work on apart from this vetting process. And if we don't have that, we're just gonna have this very draconian vetting process and it'll be very difficult to encourage a creative community to get involved in this. So you have to have these things thoroughly developed in this free environment, because that's where the geeks work. That's where their most active and the most beautiful results come. And then you have to take the best of breed and go through an additional vetting process. But if you don't have that initial foment of activity in the open environment, you'll never get there. Yeah, I'm going to add to what David is saying. So I think there are really two technical issues that need to be set up at the beginning. One is access of the apps to restricted data. I think that it is much better, as David said, to start out with targeting the apps at either a completely open access tier or to a commons tier, such that the developers and users of the apps are certified researchers. And that removes much of the burden of detailed vetting of these apps, which I think everybody feels a little bit uncomfortable with the scale that we're potentially talking about. The second is the issue of the scalability of the apps, how much resources they take up, how much storage they use, how many computes they use, and who's going to pay for that. And I think that right at the beginning, we should talk about ways that the resources used by the apps is born, is charged back in some way to the users of the apps, so that there's a built-in supply, demand, and cost to the model. Otherwise, it's going to become a tragedy of the commons, that there'll be a thousand apps and none of them will be usable because they're really choosing too many resources. Yeah, actually, resourcing is something that we really didn't talk about, but it's a bigger discussion about how to use the cloud. Another advantage of that sort of environment is easy metering, easy way to meter use, pay for use, bill use, and so on. So with the cloud, NIH can just fund the analysis, fund the server, and doesn't have to set up many, many parallel computing environments in different laboratories, and so on. Same thing with the apps. You pay for the apps, $4.99. They use advertising also as part of their revenue, so maybe that won't be the case, so some other funds will have to pay for it. You want to? I just want to, I'm worried about, so I know this is not being suggested, but just to make it explicit, I don't think we should put all our apps into one app system basket, nor should we put app systems on the table, though, these solutions are not usually compatible. One can have direct downloads, one can have maybe one or two different ways of cutting more centralized analysis. Maybe some of the big institutes in Michigan will effectively run a private in-house systems that they use just for themselves, for example. So all of those are going to be different. And I think there's a lot of details in the details here, for example, IO between your storage and the app thing and something else, and when we talk about apps, do we, what really is it a Google-like API with very restricted access to the data, or is it more like you've got a file mount, both ways? And yeah, so there's lots and lots of details. This meeting is not the place to hash out those details. The place to hash out those details is using proposals to say we should do that to mine's app. But I wanna say that there's probably, there's quite a lot of implementation risk in this, and that's why we shouldn't put all the ones, the X and the ones in one basket for this process. Yeah, I- I just find it kind of remarkable how everyone seems to wanna hear one thing and then object because it's not all things. Like data should be, there should be publicly available data that can be operated on, but there's a lot of data that's valuable that's not publicly available. Users who are very sophisticated should be able to get all the data so that they can innovate and be serendipitous. But the vast majority of people don't actually wanna do that. They want somebody to actually provide them some answers. We should do all of the above. Apps are good, but it'll be a little complicated because we'll pay for them and vet them. Somehow instead of seeing the diversity in these things and saying we want diversity, we're saying you do this, you won't do that. But no one's saying do one thing and not the other. Yeah, different users will be served differently with different applications. Steve, have been waiting for a long time. Yeah, so I wanna put in a plea for something that hasn't been discussed yet, which is comment on reproducible research, that the NCBI BLAST server has been one of the greatest boons for democratizing access to computational analyses of data. In our hand, most of the result, a very large fraction of the biomedical literature belongs in the journal Irreproducible Results in that you cannot actually get back the same result that someone did on the day they did their query. Heraklitos of Ephesus said you cannot step twice into the same river because the waters are always flowing on to you. We were the problem that we can't twice query the same database because the data are flooding over us so rapidly. And so we need some sort of solution. So actually when someone does some sort of analysis, you can figure out what did they actually do and be able to figure out, what was that actual analysis they did and be able to verify that at some later date? And I think that if we're building a platform now to enable these types of analyses to a larger biomedical community, we should be thinking about this from the get go. So I agree with Steve. I wanna come back to David's thing. I think that everybody is interested in everything. And I think we just need to come up with a plan that steps out what are the most important thing, what's the most important thing we can do first? Obviously to get simplified access, right? Then what's the next thing? What's the next thing? I mean, I think that everyone's interested in all aspects of everything that's on the table. It's just that some are harder to implement and we'll take more time. So if we could just come up, I think at the discussion time of the workshop or sometime, a priority list and how difficult we think each of these steps will be or how easy and this is a no brainer, we gotta do this ASAP, this would be a good outcome of this workshop. I mean, I would say that that's definitely how we should move forward. But given the fact that there is a, the largest part of biologists is underserved by data, it would seem to me that we shouldn't prioritize the easy things first only. So we should give weight to impact. How about cost? I have a feeling that cost is gonna weigh into this and the things that cost a lot of money that would take the longest to implement. Fine, but as long as NIH is spending a huge amount of money generating data and a teeny amount of money to make it useful, we could question whether that's good use of the national investment. No, and I agree with you on that, David. And there are lots of ways to make it useful, but I think there are very simple ways to make the data more useful to the greater good. And I think this has also evolved over time and I think trying to put forward a plan that would have everything from start to finish at this stage is kind of a fool's errand. I think what we should be thinking about is the first two or three steps that we all can agree upon as opposed to, it has to go this way, this is the best plan. I mean, this will evolve and this won't be the only meeting I'm sure in terms of this question. Well, and also maybe deciding, maybe there should be additional workshops that talk about certain of these things and what it would really take to do some. I guess there are some clear things that we could do now, but I mean, some of us may decide that it makes sense to work on building larger scale infrastructure that'll solve sort of not today's problems, but maybe six months or a year from now. I think as long as we have a diversity and have those plans forward and support at the right scales for different prototypes or projects, it seems to be a great outcome. As long as we say, you know, these are the three or four things that we definitely wanna explore and have some support and some of those can be easy and some of them can be harder and it seems like that would be what we would want. We would be very enthusiastic about a specific area, you know, individually and we'd love to contribute to that, but that shouldn't preclude people from, you know, releasing data like DB Gap is talking about doing, which is also hugely valuable and can be done quickly. Yeah, so that brings up the question of what to pilot, how to move forward, how much testing we do upfront so we can learn for the bigger systems. And I know Mike wants to talk about that in the wrap up session. Thanks to all the speakers for this morning. Lunch start, yes, thank you. Thank you.