 Thanks, Lita. We now want to have two reports on recent meetings. Eric referred to these in the Director's Report. The first one is on the Next Generation Sequence Data Management Workshop, and Peter's going to present that. So as Eric alluded to, I'm going to be talking about a workshop we held a couple weeks ago. I shorthanded it down to Next Generation Sequence Data Management Workshop. We had a more elaborate name that Eric referred to, and this was held May 4th through 5th, and I'm sort of the sacrificial lamb of a very large project team that helped coordinate this and run it. I start with quotes to Ponder, and Adam sent me the first one from Oscar Wilde. It's a very sad thing that nowadays there is so little useless information, and I think this is a hallmark of this meeting. The other one was in some of my research a quote from Michael Waterman talking about the history of databases in 1999. He says, we have found it difficult to cope with today's volume of data, but the problems of today will look elementary in just a few years. So here's the problem. The problem is cheap sequence. As sequence gets cheaper and cheaper, you look at Moore's Law, and this is just sort of a measure of the cost of processors, the ability to put processors on a chip. It's not going down as well as the cost of sequencing, and if you layer on essentially the cost of discourse, it's going to be very summed Moore's Law. So we have a problem is that we're generating a lot more sequence than at least cost-wise we can stably archive. And the precipitating issue, Eric alluded to this, I think it was in February that NCBI publicly announces the closure of the sequence read archive due to budgetary constraints. I don't want to say TCGA can take the blame for it, but I think that they are a substantial contributor to that, and I think that the ramp up of TCGA sort of puts them over the edge. In the process of ongoing negotiations or interactions of the resources board with NCBI, additional funds are provided to ramp down the SRA activities. And at the same time NCBI was ramping these down, other archives are still accepting data. So the European Nucleotide Archive and the DDBJ Archive in Japan are still accepting data, although they announced that they're anticipating problems. So we pushed together a workshop on fairly short notices as Eric alluded to. It was multiple NIH institutes. It was a day and a half workshop. What we tried to do is address the scientific issues around archiving and managing next generation sequencing data. We didn't really want to discuss, you know, G, the SRA is going away, how do we fix it? The invitees represented a range of researchers from different scientific communities, sequencing centers and computational biologists. We also included journal editors, funders, and representatives of the major sequencing repositories to try to get everyone represented. And in addition to these classes, we tried to get representatives from people doing big science and people doing smaller science, so the bigger centers and the smaller centers. As co-chairs, we recruited Owen White and Gabor Marth. Owen is the PI of the HMP Data Analysis and Coordination Center, and Gabor is one of the analysis leads for the 1000 Genomes Project. So the agenda, it was realistically set up on two days with two different goals. The first day was set up to discuss, you know, what are the different needs of scientific communities for different levels of this data? Going in, there's a lot of talk, you know, do we need to throw away reads? You know, do we need to not even store the data? We wanted to try to address that. We also layered on this constraints that we need to consider of the need for human-subject protections for any human resequencing because it's potentially identifiable. The second day addressed the sort of the infrastructure, you know, what are sort of the business models you can come up with to support the submission, archiving, and dissemination of this data. The goals we went into, we sort of broke them down into different term goals. Short-term, we wanted to look for potential approaches to sort of alleviate the burden. I mean, I sort of alluded to the fact that, well, maybe we don't really need to store this, and that will make people much happier. I mean, we sort of came to that conclusion, I think, for the image data from some of the sequencing machines. Sequencing centers no longer store imaging data. Maybe they don't need to store reads. In medium term, we wanted recommendations on business models for the future of this data management problem. And what's the best way to arrange this? Do we just go with a centralized model? Do we distribute this? Are there ways of funding this that are better than what we've come up with right now, which is hope that NCBI will pick it up, which is really an untenable long-term solution. And then for long term, we need to think about, are there additional research that we need to support to sort of solve the hard problems? We started the workshop with sort of several speakers framing the issues. We talked about the fact that the DNA sequencing community has a long history of providing access to sequence data. It goes back, actually, to when sequencing became commonplace. There was requirements for submission of sequence data, the archives for submission of an article. And then when the Jumen Genome Project started ramping up, there's sort of a history of rapid pre-publication data, which now in the strategic plan that was published in February, it's called part of the essence of genomics. And it's been discussed at various data release workshops at Bermuda, Fort Lauderdale, and Toronto. The journal editors, their viewpoint about this was that they required data to be made freely available upon publication because they felt that this enables reproducible research. And that they are fairly strict about it. I mean, the journal editor from Hillary Sussman from Genome Research said they make exceptions, but that these are negotiated and they don't like to do it. Finally, we ended it with XJOSU and Bernie talking about sort of the issues that EBI faces with this and how they look at this. And he talked about the fact that this curve showed it. Sequence generation is outpacing the ability to store and analyze the data. And essentially, it's those curves of the amount of data that we're generating layered on to the amount of storage we can buy at a fixed cost is making it so that as you generate more data, you're just going to have to almost buy exponentially more disks to store it. And so, Hedas talks about, and I think this is good, you need to deflect the curve. You need to make it so that the storage of the data becomes manageable under the current budgets for buying disks, et cetera. The second thing that they emphasize, and I think this was a hallmark of the meeting was, is that genome projects, and here I'm talking about the interactions of genome projects and repositories, data management centers, represent an ecosystem. So there's interactions between data producers, the archives and the data users, and it's these interactions that often cost quite a bit. And I sort of use, I asked to borrow one of his slides to show this, and here he's talking about costs involved in being a data archive. You can talk about volume costs, costs of putting a disk, costs of moving data in and out of that disk, but there's also submission pipelines, and there are output pipelines. And the thing about submission pipelines and output pipelines is that they scale by diversity in number of submitters. The more submitters you have, you linearly scale this. You've got to have people to deal with this. It becomes a fairly expensive proposition to fund these types of things. In terms of the volume costs, if you can improve the technology, you'll scale this, you'll scale it according to the technology. And so it's sort of like, you either can do it or you can't. If you can do it, let's store the data. If you can't, you've got to come up with ways of throwing it away, or at least compressing it. So the next meeting sort of focused on what data do communities want. So prior to the meeting, we engaged co-chairs from different scientific communities and asked them to prepare a white paper on issues around storing data. And we prepared them with questions about what data will you throw away? What data do you find most important? Would you pay for the data if they were in a pay-per-use type situations? And the five communities that we selected were de novo sequencing, sequencing new genomes, metagenomics, resequencing, and typically this is resequencing humans, although certainly the same issues will occur with resequencing lots of any species. Cancer genomics and then functional genomics. And sort of the conclusions from this, we had, they presented, we then had talks on the idea of compressing data, what you can do with it, storing data, and then they had a breakout session and they came back with the following sort of observations. One, these communities want the data electronically. Some people would go in and say, you know, the joke at the meeting was that you can consider sequencing cost to be zero. And if sequencing cost is zero, then you don't need to store the data, you just store the samples and you can regenerate the data. And this was uniformly, they said, no, you just can't do that. There's too much variability in the samples. Samples often are very precious. The other thing is that between the different communities, consider data complexity. There is going to be no single solution that's going to work for any one community or even any one big center as opposed to a small center. However, they could come up with a conclusion, most communities want a minimum of reads. So you've got to keep the reads. And one proposal that sort of was floated as being compatible with most communities is that you have two data types. You have reads aligned to a reference. And in this case, the standard reference for human resequencing in cancer genomics is BAM. Although there was a lot of talk about other compression formats, one called CRAM. It's got a nice acronym. And then you need a single level of derived data. Again, the human resequencing, the example was BCF. Other projects might function genomics. You might want a big wake type file just to call an example of a peak file. The other conclusion from... Sorry. Can I make a comment first? So Mike and Jill and Rick, Richard Wilson, Rick Wilson were at the meeting. Oh, sorry. So I would amend that. Most communities want a minimum of reads. What I heard there was that most people felt that for a minimal time period, reads should be kept. Not that there was a small number of reads that should be kept, but the reads should be kept for some minimal period of time, specifically while the technology of processing data from all these new platforms reach some kind of stable state, at which it is not now. Yes, and that's a very good point, and I forgot to raise that. And the reason why they want the reads is it's just they felt the technologies weren't stable. They weren't robust enough to say that we'll throw stuff away because we've extracted all the information we want. So there is still a lot of technology development that needs to be put forward in terms of extracting information from the short read data. So metadata is considered to be very important, and it was sort of looked at as being underdeveloped, or at least not pursued actively right now. It's a very small size of the data. Unfortunately, it's complex and it's domain dependent. So the conclusion was that there needs to be an effort to pursue standardized metadata. And in a sense, one of the reasons for doing this is at least some subsets of metadata basically track the data provenance. You can find out what transformations happen to the data. And this is very important, again, for the concept of reproducibility. How to do this is hard. So even some of the people at the meeting that are involved in data standards said you don't want a large committee to make the decision. And you want to try to streamline it so that you make it so that you're not overreaching the amount of metadata that you're collecting. Ideally, you should pursue this across projects if possible. A VCF file format created for cancer genomics should work for human resequencing. And so again, the conclusion from this and action item you'll see later is that this should be an immediate follow-up to try to develop working groups to do this. But to try to avoid data standard bodies that may try to drag this out. So we talked about how to reduce data storage. And there were many options. I think one that was very enthusiastically looked at was come up with better ways of compressing the data. And the one that was talked about was a reference based compression. I don't remember the numbers but EBI has been talking about this and they think that using a reference based compression if they can get into a production mode will make it much easier for them to deal with this at least for the coming, I want to say five years but at least two, three years. They also recommended priorities, prioritizing projects for long-term archival. Maybe you don't want to dump all the reads into the database right away. You want to wait until the project is close to publication. There might limit the maintained lifetime of archived data. So once a data set has been superseded by another data set or the data quality is such that newer technologies are better, you can consider reducing this. And then just in general there was a sort of almost a plea to reduce the redundancy of intermediate data sets. There's a lot of centers that have multiple copies of very large data sets lying around their computers. And when these data sets are very large you have to try to to eliminate that. So the action item which was probably addressed to NIH is that we need solid proposals to get us through the next two years. And that the mix might be different for different communities. Different communities might need different things and that the funders needed to consider this. We talked about data storage and service options. NCBI talked about their view of a centralized data storage and distribution. And they presented a very nice model where this can be scaled for different solutions. In other words if you just wanted to store the raw data without any metadata associated they have an infrastructure to do that. You can add more metadata and the costs go up. But it depends on what a project would want. In terms of distributed systems I there was a lot of discussion about cloud. Whether cloud was you know the thing we should be doing either now or in the future. My sense was that commercial clouds were not seen as viable. Largely because of cost. But also because the programs that people use haven't been sufficiently engineered to take advantage of the cloud infrastructure. However there was a lot of enthusiasm of the advantages of bringing the data and the compute together. At least for the smaller labs. Small and medium sized labs. And so the recommendation was to pursue this more in incubators. In a test set. I should point out that we did talk about data security and human subjects. And there was a presentation by Brad Ozenberger about how the trusted partner policy may allow for distribution of controlled access data by groups other than NIH and NCBI. Which would allow more of a distributed system. It would allow people to distribute controlled access data without going through the core NCBI infrastructure. I think that they also felt that this was a very important thing that needed to be addressed. It really is limiting to the use of some of this genomic data set. The idea of the data access has to go through is provides a barrier to use of the data. And they actually wanted it to be addressed by a larger group. One that would bring together more bioethicists, more LC researchers, more people that actually deal with data security. In other words is a cloud really secure for controlled access data. The other thing that came out a lot was in terms of the data storage and service options is the recognition. Again, it has to go with the ecosystem model or the idea is that there are distinct activities for data submission, data archive, and data distribution. What came out was the concept of hubs and brokers. You can think of as a hub as a place to store and share a sequence with other hubs. There would be multiple large hubs that will receive data from submission brokers and then provide the data out to distribution brokers. You could think of these hubs as being specialized or general. If you think about the current existence now of what the situation looks like, it's clear that NCBI and EBI and DDBJ are very large hubs. They share data between them. They actually act as brokers also and that they're responsible for bringing the data in. Although I think there was some feeling that this is not a situation that works. Again, it goes to UN's earlier slide that the data submission and data retrieval problems are very expensive. The concept of brokers would provide project specific fittings for submitters and users. In other words, a broker would provide domain specific knowledge that will make it easier to take your data and fit it into the hub, which is a significant problem as anyone has dealt with a large project which has a data coordinating center. There's always this tension of getting the data producers to provide enough metadata to the DCCs so that the data is useful. The brokers, you could consider it as receiving the data from the submitters and providing data to hubs. Then it may also be retrieved data from hubs and can provide it as data distribution brokers if necessary. They also pointed out that in the future we need to think very creatively because in the future there will be millions of human genomes or exomes or expression profiles or maybe in the equivalent would be trillions of E. coli genomes. The question will be how do we compute against these data sets. Solutions for the next two to three years may not work in the future and we need to think hard about how to go forward. We talked about funding and basically business models for doing this and I think the conclusion was and I think this is a conclusion that I think NACRI had going in was that the funders are going to have to start thinking about how to pay for archiving if not up front or more directly. We can no longer have a situation where the archiving is assumed to be done by another institute in this case or another center in CBI. I think there also is a recognition that informatics activities are underfunded and that we need to make sure that the projects have sufficient funding for informatics data data production data analysis of data during productions archiving and analysis for generating derived data. I think this fell on both the funding agencies to recognize the principal investigators to put it into their grants and the reviewers to recognize that this is important and not to recognize it as an item that can be deleted. Summarizing the action items, I think Owen put forth sort of a he put forth his idea that gee there's no urgency to this problem and I think there was a lot of pushback. There is an urgency to provide near term solutions for the wide range of projects. Again there needs to be a working group on metadata. The concern with the metadata is if you go to a distributed system you want to make sure that the data all is the same that people can download it from different places and understand what it is and combine it and integrate it. There needs to be a continued look at compression methods and including the way of efficiently compressing quality scores. There was a conclusion that I didn't mention this but there was a conclusion that quality scores are helpful and that many communities wanted quality scores but there also when you look at the compression methods quality scores for the most part didn't do as well as you would like unless you start throwing away quality scores unless you reduce their granularity. I mean if you reduce their granularity to a basis either good or bad then you do well but a lot of people want more granularity. Again there's an idea of a pilot project to use cloud computing to see how this would work either funded by NIH or elsewhere. Workshop on human subjects and security issues and sharing large genomic data sets. Exploration of the data hub broker model and then think about data structures tools and database beyond the two to three year horizon. At the meeting there was an announcement of a change in the status of the SRA. So Eric presented the view that was true before the meeting which was that NCBI stated the SRA would be shut down around October 1. Actually they always told us 8 to 12 months. So from May onward they would only be accepting registered samples which is why Eric alluded to the fact that we've been accumulating sort of an inventory of the project we expect to generate. However during the meeting Jim Mostel announced a change in SRA status and this is a direct quote from their website. So NCBI has been working with NIH institutes and NIH grantees to develop an approach to continue archiving certain subsets of next generation sequencing data that are used most frequently. And so the current status NCBI now plans to continue handling sequence data associated with basically the functional genomics data RNA-seq, chip-seq and epigenomic data submitted to GL. Genomic assemblies that are submitted to GenBank. And then the 16S ribosomal RNA data associated with metagenomic submitted to GenBank. And so not all of SRI will be shutting down. They should point out and I think Eric pointed this out the DV gap it will not affected by the SRA shutdown. And with that I'll close with my thanks. Again we pulled this off in a very short time frame with a very good working group. The co-chairs were Robert Marth and Owen White and then 75 people, about 75 people who came on very short notice. And then we hope to prepare a written report to follow with these conclusions. And so I guess I'd like to turn it over to the three people that were at the meeting. So Mike, Jill and Rick to see if there are any comments that anything that I missed or anything that you would like to raise. So maybe I'll start. I'd say Peter you gave a very good summary. I'd just like to reinforce a few points. Certainly the thought of the closing of the SRA did cause an uproar in multiple communities including the one I'm involved in in terms of association studies with re-sequencing. So the meeting was certainly well-timed and important although it's certainly not the first time this issue has come up. I think it was a session that Rex chaired actually at the five-year planning meeting last year. There was a constant joke about you know the thousand dollar genome or the one dollar genome. Well that's obviously just talking about the experiment has nothing to talk about in terms of the archival and computing on the data. And so this really isn't a new cost it's just a new issue. It's just come up more strongly. I think we're all excited about the sequencing costs and the two curves and it's important to note that that was Moore's law and it was sequencing it was on an exponential scale. So that wasn't just a factor difference that was multiple orders magnitude different. What was not surprising to me but a little bit scary to me just to re-emphasize is nobody said we don't want the sequence reads. So I think we were all hoping or at least some of us were hoping we'd go there and some communities would say we don't really need the raw data or at least the raw data at the level of the reads. I didn't hear anybody say that. And so while you know saving avoiding the storage of sort of multiple intermediate sets of files will certainly save something and that's probably one good result of all of this discussion is forcing people to think more carefully about that. Fundamentally we're still talking a huge amount of data. Something that was interesting here is that ENA and DDBJ are not going to be the solutions to this. Several of us thought well maybe we can just send our data to one of those repositories and some of that will happen but certainly not enough to solve the problem. So the real serious issues I would certainly agree with the notion that the cost of the archival has to be put into grant proposals. I think that there was not total consensus on that point but I think that's the funding model that was discussed that probably made the most sense and then one final sort of aside it was very interesting to hear people talking about metadata and standards and there were multiple people making the point that yes this is really important yes we do need to do this but please don't get the standard people involved in setting standards involved in the discussion of standards that that would make it much too complicated and that the approach to take was more an intersection approach not a union approach. Thanks Mike. So Peter I think it's a great job summarizing what went on and Mike added some good comments I think I would add a couple of things one is that there was a group that felt they didn't need all the reads and that was the functional genomics group I mean however what we did agree was this issue about short versus long-term solution that short term we can't throw away the reads for RNA seek data we still are dealing with the issues of data sets being swamped by abundant messages and how are we going to deal with lowly expressed genes and how you know the stability of the platform and all those issues but our feeling was that this was a shorter rather than a longer term solution and there were groups who felt that their interests were more on the data processing approaches and algorithm side and they wanted those reads around forever and it may be that there's a solution which is in between which is that there are some standard data sets not standards in the sense of the format but even when we have a more stable situation that we may want to keep for some data sets those reads around so that people can continue to work on different approaches and things like that the cloud thing is kind of interesting because of course NHGRI has funded a couple of these kinds of experiments using the cloud and it speaks again to this notion that it's certainly true in our sequencing center and it's probably true at WashU also that there are copies of big data sets all over the place and for the past six to nine months we've been making a concerted effort to stop people from being able to make copies of these huge data sets and so that idea of putting the compute where the reads are stored that sort of speaks to a cloud-like solution whether it's commercial or private or distributed or not I think we have to think about some more but you just can't be moving or copying that size data set anymore it's just not a viable solution and so we have to become much more disciplined about that and I do just want to emphasize one more time that the solution that we choose for the next two to three years may not be the solution that we need five years from now even if sequencing does go to zero the experiment yeah I don't have much to add I think both Mike and Jill have added some good comments to an excellent summation from Peter I think there's some of the actions that Peter had listed on a previous slide I think are good I doubt that they are adequate because it's a pretty tough problem and and to emphasize another point that you had on one of your last few slides this really has been underfunded substantially I was going to ask Jeff Schloss how long is technology development program has gone on and compare that to the amount of time and effort and money that's been put into sort of you know worrying about getting to the point where we could actually handle the analysis of a thousand dollar genome pales by comparison so I think we're going to talk about this I think there are some other things in the works and hopefully between some of these we have some idea of where we go with this in the next couple of years but this is a really tough problem can I just have one more thing Peter because you know in the background as this meeting was going on a number of us were chatting about the fact that we never really got real data about how many times various data sets at NCBI now and the SLRI have been touched and by whom sort of what the distribution of users were how long those were actively used and so on and I think any solution has to be predicated on getting data like that and I think that's something I agree that we said we would do that and we didn't so it wasn't on your job yeah other comments questions so Peter I just wanted to follow up on I think a comment you made about people agreeing that storing the samples was not a fix to this problem but it seems to me that as the technology for sequencing gets cheaper and cheaper and at the same time I think one can assume that problems of quality will also be addressed so you'll not only have the ability to generate the data more cheaply you'll also have the ability to resolve issues in existing data sets seems to me that at least for some projects perhaps certain larger projects where samples are particularly valuable and really considered effort has been put into collect and validate these samples would it not make sense to address this hand in hand I think there would be a strong rationale in some cases to predict that 10 years from now it might actually be a really good idea to to regenerate the data now granted you learn things and you want to design new projects you don't want to keep going back over existing projects there may be cases where that makes sense if nothing else as a risk mitigation strategy so I I think that's I don't I think that we were trying to address sort of the medium term the two to three years and I think that in the immediate immediate term I think that was not I think that you're pointing out that yes we may have to reconsider this issue you know as you say the data quality gets better on these machines and as the cost does approach zero but the issue but the issue that was raised time and time again is that some of the sample collections are just too small and too precious and you just can't resequence you know you granted there are some studies for which the the amount of sample that you have will allow you to do something like that and we all acknowledge that and said that for that for those cases it may be that resequencing would be a good thing but I guess what I'm saying is something different which is sure retrospectively maybe in certain cases you you can't or in general maybe even for certain types of samples you can't do that although can'ts a big word I mean you know the amount of sample one might require a few years down the line could be much less what I'm suggesting is that as you as you say that PI should build into an application you know a budget for arch for archiving the data perhaps for some of at least some of these projects certainly not all some of these projects perhaps PI should also build into the application some budget for collecting enough sample and for storing the samples so that they can be reanalyzed because if you look at that exponential curve that everybody keeps pointing to I would argue that that's a very logical reason for doing so I don't disagree with you at all Mark it's just that in some cases you're right one should never never but in some cases that's going to be extraordinarily challenging I'd like to really endorse Mark's point I think history is illustrative here I just recently moved my lab and one of the things I found were boxes and boxes and boxes of sequencing gel films and the first thing I did was toss them you know so it it in many ways is sort of the same issue and when we thought about the utility of having those primary reads present in the lab we said it's faster to just re-sequence and I suspect that's probably a pretty reasonable way to think about it so Brad just on that point TCGA is trying to store away a couple of micrograms from every specimen just in case things that we can go back 10 years from now and do a better job but also along these lines I wanted to speak just a minute about compression it should be part of the we think it'll be part of the the solution if you know if you can get say down to 10 percent of the current sizes it could really make a difference but we really don't know what we lose so you know for again the TCGA community wants to do the experiment to see take some files do the compression and look at and see what's lost in terms of mutation calling and pre and post compression that's got to be done so I'd like to just say that I agree with Mark as long as we say sometimes and a large differentiation is between what is happening now and in the future versus what is possible with existing studies and I think I think that was the stronger focus of the meeting but you're absolutely right both are very relevant I just had a question on the bullet that is workshop on human subjects and security issues is the suggestion there that it's primarily on the security issues around human subjects data or was it broader than that I think it's broader than that I think it's the idea of looking at whether research participants are comfortable with the idea of having data on the cloud for instance you know even if you can secure it well enough there's always going to be this fear that it's on the cloud and someone can download it or someone can you know the example used in the workshop was is you have someone put up a program on the cloud to compute on it and it may accidentally spew the data out to the world and so I think I think it was a combination of the sort of what I always refer to as the LC issues of of you know how do you provide efficient access to this and then keep you know the the research participants protections in place and then layered on top of that you know what are the security that is really required to give people the confidence that gee you know it's not going to be like I forget what's the last one the the gaming system where they released all of the identities of the and I can't remember which gaming system it was so Lisa Sony PlayStation Peter it's Rick can I make a comment go ahead Rick so so isn't there data about how secure I mean there are other groups using the cloud in this way and I know that the loss of the data from the from the Amazon recently was supposedly they're very low-level stuff meaning it's not the highest security stuff that that but surely we can learn something from businesses and stuff who use this to find out whether that I think that's probably people are over worrying about it because I bet you there are ways to make it and I think that was the idea that we didn't have the people in the room to address those issues okay it's worth I think finding that out so so Lisa Lisa you need to speak into the microphone just keep talking is that working yes another major aspect with this was researchers access to the data that as it is now with the data access committees researchers through their institutions have to get permission on a data set by data set basis and so that becomes a real bottleneck for being able to look to for analysts to get access to data there's been talk about something like a genome commons or a license thing for analysts where analysts can get licensed perhaps by the university or some organization so that way they can have access to all the data sets and promise to obey all the you know consent restrictions and stuff like that so that's one side of it is streamlining the use of these data because millions of dollars are going into you know all these studies and yet it's actually kind of hard to get to use those data Ross yeah I have a question about the pilot a project to use cloud computing and I confess that that I have a lot to learn about cloud computing it was actually very off-putting to me the idea of shipping data off to the cloud but I hear because I don't know what it is really and that's my question is the but I do understand that this gives you that the the possibility of having the data in multiple distributed places and hence that that could generate stability but I also understand that the current cost is prohibitive in the commercial cloud so one question that is there a non-commercial cloud how and what exactly will I mean there's a lot of research in clouds and and I'll give an example of modern code one of the people persons at the workshop was Bob Grossman who's at University of Chicago and he has the BioNimbus which is a model for cloud computing for academic researchers and all the modern modern code data is on the BioNimbus cloud and again it's it's his academic cloud so just just to add to to that there's a couple of what we call public clouds which are academic ones so there's Owen White and Florian Fricky Fricky you are doing what's called Clover there's the one that Bob Grossman had there's a couple at NCSA and ANL the things about these is these clouds are being tuned for the kind of data that they're needed for right so the issue that Peter brought up before was that the current architecture of the commercial clouds like Amazon etc are not tuned particularly the storage component of that and the way you access the data and the way that the software set up in that cloud is not appropriate for the way we want to do it the advantage of them is that they're bigger and they're faster and they have more of everything the downside is it's not tuned but there is a lot more interest now in the public section those academic versions of that and they're increasing in that respect a couple other things I think Rick said on the phone about security we at our cloud meeting last year we had Amazon's guy in there for security who was an ex super duper NSA Supremo who dealt with cyber threats from God knows where and basically the the security measures that are now implemented in the cloud are the same as what you would pretty much have in your server farms so it's just it's a case of concern about that in terms of the human subject issues what Peter said which is the streaming out of the data that goes there and a couple of other things about the cloud it's it's right we have very large datasets and you want to bring the tools and the data together but it also offers another situation for the smaller groups and that is those that do not have IT infrastructures can utilize these events where basically they can run some tools and some data together they don't need to set up large infrastructures for them and then they can compute on that in a particular environment it also means that if you have a lot of data from a lot of different groups like in metagenomics we can do this it allows a lot of people to use those data pieces together and what came up at the meeting and those who were at the meeting from council can correct me if I'm wrong was there was an idea of should we have some sort of piloted NIH where we have a cloud that deals with some of NIH policies one how do we look at some of the public clouds like what Owen White was doing with Clover and Bob Grossman and three if there are commercial vendors that are involved in this how do we actually work with them at NIH and there are several initiatives going on at that so I hope that helps you understand some of the pieces we've got so unless there are other burning questions about the meeting itself I'm gonna step in and cut the discussion off and the interest of time there will be opportunity later in the council meeting to continue the discussion more on how we move forward