 that sufficient information exists with which to understand, evaluate and build upon a prior work if a third party can replicate the results without any additional information from the author. So what would be sufficient information? Well before we saw the pyramid and this is another view of the pyramid that maybe it's easier to see if I put these colors. So usually what we make public to the world are the blogs in red, the research products, the publications, presentations, visualizations of the data, but everything else is what is necessary to actually replicate the results to verify the claims, well verify the claims that are in the publications. So we need to make sure that, well to be able to share as much of all what we call, what we, well I'm marking yellow and the blues would be all the context of the documentation, the conditions, the environment, the notes taken about the experiment, etc. So in a similar way that we were, well we said before also, if we just, if we would say that we share just some of the information, the data, well the products or some part of the scholarly work but not the entire data set and the documentation that goes with the data set, if we, if we try to replicate that now we probably is, we might be able to replicate it if we can ask, if we can contact the author and get enough information. Of course as time goes by and we think of 50 years from now either the author is not around or if he's around probably he will not remember anything, maybe even 10 years from now he will not remember anything. So the more information, again reiterating the point that was done, was said before, but even in the case that you share all the information is difficult to replicate so the more we provide the documentation around the data set the more we can make sure that we will be able to verify the results. So things that can happen we don't share, this was an unfortunate case that many of you know there was published in the New York Times a year ago about a research of a psychologist in the Netherlands that had falsified the data, it was found in this case fortunately but and as part of finding actually that that had happened and there were several studies done and showed that one of the surveys that 2000 American psychologists well clipped at or knowledge that at least 70% of the 2000 and knowledge that they had not shared all the data or they removed some parts of it to make sure that the results that they reported were consistent. And also there was an analysis of 49 studies that found that the scientists that were more reluctant to share data actually were more likely to then find that the results didn't correspond to the actual data. So we didn't match the actual data. So in this article they said well we have the technology it's time to use it. So I'm not sure what technology they refer to but I will talk about one platform that helps to solve that problem. So another article on the other hand that shows that if we share data it might be possible to or it might induce more citations about your study. In this case there were well this paper in plus looked at about 85 clinical publications and clinical trials and found that the publications that had data that they were sharing the data they were 70% more frequently cited than the other publications. So let's talk now about the Dataverse Network. The Dataverse Network is this open source application to share research data and it takes care of the long-term preservation and good archival practices but at the same time it provides the recognition for the researcher and gives credit for the data. So how does it do that? Well first for I go into the details at a Harvard we we host a Dataverse Network for Social Science. It has about more than 700,000 data files. Any researcher in the world can deposit data in Social Science and just from this year just starting now actually we have the Dataverse Network for Astronomy Data. I believe that in other fields the the challenges that we've been finding and we've been learning from with our platform are similar so we have incentives to share how we define what we define as a unit of data citation. What does it mean sufficient metadata, sufficient documentation? How do we deal with formats and software that will become obsolete over time? How do we make sure that the data is discoverable and the fact that there is increasing amounts and sizes of data all the time? So in terms of incentive we provide a platform that allows you to create a data archive or virtual data archive that is just for an individual researcher or for a research project that can be branded or embedded in your website and basically on the data you're able to manage the data but the actual data sets are stored and backed up in a central repository. More within while giving credit to the researchers the Dataverse generates automatically a data citation and that data citation includes well first the information, the author information, the year that it was deposited while published, the title but more importantly it includes a persistent URL based on a handle. Handle it's similar well similar like DOIs and well DOIs are actually based on handles and then it also produces a numerical fingerprint which is a checksum or hash based on the the content of the data set so it's independent of the format. What we do is that we convert the if we recognize the data file format we convert it into a canonical format a standardized archival format basically. We apply the checksum to that and give that numerical fingerprint so over time we can verify that the data set with the values of the data have not changed. That data citation standard was published by Michael Almond and Gary Keane in 2007 and we've been using since then. So how does it work? The software or the application is a centralized repository that you can host in your institution or we host at Harvard again for social science data all over the world. Within that repository you can create as many databases as you want and each one is its own virtual archive that is managed by the data owner or by the research group with its own branding and within a database you have studies or collections of studies and studies are self-curated by entering making it easy to enter metadata about that describes the data set plus loading data files and any other documentation files or scripts and all the other boxes that had there that were the context of the data set and the additional complementary files. So it also supports data versioning when you create a study and enter the metadata and upload the data files and complementary files then it goes in through a review process. We provide different roles administrative of the data version or maybe the data owner or contributors or curators and a contributor might create a study but there might be there would be a curator that will review it if you well there is a lot of configuration so you can make that available or not depending on what you want. But then you create a you publish a version so you publish your data page with the data sets and when you publish that version you get a data citation that includes the the version number and then if you do any edit or you do any additional analysis or cleanup of the data set you would date the study review it again and publish the new version. Now you have a new data site well a data citation that has the same handle if you change the data file it should be a new UNF because a numerical fingerprint has changed if the data file is different the contents of the data file are different but you can still reference the old version and that's obviously a very important thing for good archival practices because as well as we were talking before and camera mentioned that you don't you want to make sure that you if you cited that you reference the data in your publication you would always be able to access it so sharing data again it's not about passing sending the data from one researcher to another but it's to make sure making sure that once you've referenced it it will be accessible in perpetuity and so we keep all these versions so you can go back to the original data set that the publication or the results or the claims used but on the other hand you can you can have new versions of that data set. We're also working with publication repositories and journal systems to link data with with publications in this case the astrophysical well journal repository services has a when it when you find out a publication you can see a link now next to it that it has as the archival data when you click there you find the all the data sets of the studies that are part of that well that are used in that publication and it goes to a dataverse page for that study with all the files. Also let me mention that in addition to working with the astrophysical group we're working with the open journal system the OJS to integrate the journal journals well any journal that uses OJS internally the system integrate that with the dataverse networks so as you were asking before when when the author deposits a submits a paper now they cannot have a bloat enough in supplementary a supplementary files what it will happen is that it will go they will be able to upload well deposit the data and it will get deposit directly to a dataverse network so central repository that is independent of the journal system that so in a well the centralized repository of the the actual application that that hosts as all the data versus provides also all these other functionality to make sure that the data is duplicated in multiple locations and this is using the lots of copies keep staff safe the locks to do that it also helps reformatting well it reformats automatically data files that recognizes as to an archival format it extracts the metadata from those files so the metadata is searchable is index it's all indexed and then becomes searchable together with all the descriptive metadata that the author enters it exports all I mean all the metadata is a story in a relational database but it exports it into XML files in following standards like the doubling core or the data the community documentation initiative schema the DDI and it also supports away I PMH for so to be able to harvest metadata from one dataverse network to another from once any other system that supports AI to another so it's searchable across systems and suppose web services and APIs so this for this interest that in the in the architecture it uses Java is six and and it has a relational database it's all again open source and the files are storing a file system we use are for additional data analysis and be able to process the data as it gets ingested and we use our surf for that so and that's it for my talk and open to questions and suggestions what needed to be changed when going from data versus that were specialized for sociology information to something like astrophysics and other scientific information it's so yes so as we did with astronomy you could use it as it is for just as a file storage plus the metadata there is some some of the metadata is very high-level description of a of a data set so it would apply to any field and then we have additional custom fields that one could configure for you for you this for your community if you want to add a value like we've done for social science data set you you could take we did with for example SPSS a state of files or other files used for from a statistical packages that are commonly using social science we we convert them into other file formats like the metadata could do that for some of the the files more commonly used for for imaging and so be able to extract information from the header for example provide that automatically as a searchable terms but so that would be there that's what we're doing for astronomy we're taking fits files that are the most common well data data files used in astronomy and be able to ingest them the same way that we do for social science data sets but they are using it as it is now also just as a as a way to group data files that are part of the same publication or the same claim yeah with this centralized repository model how well do you think that's going to scale to the sort of very large image data that large-scale image data that lots of us deal with so we're very interested in having a centralized architecture at least for annotation for the for the work that we do but it's not really practical for us to to be hosting petabytes of I mean there's no way we would get the funding to host petabytes of data centrally to the funding are you asking well just the practicality of a centralized model I guess is going to depend very much on on the size of the data set so with social science data sets which are basically there are maximum two gigabytes per file although within a study you can have as many files as you want so we were looking into that too and part for astronomy but also because we're talking with other groups at Harvard for larger data sets I think right part of the architecture would change and but the the central repository or Lisa federated system with multiple repositories which is supported by the by the architecture of the reverse narrow would still work I mean maybe not through an HTTP upload also but we're looking at a ways to do well have the larger data sets in the central repository and be able to get further slow resolution I guess within what we're doing the only way that could work if there was some big centralized agreement to fund some central repository which doesn't seem very likely at the moment there's an elephant not in the room and I wonder if we should ask what we could learn from it I I don't think anyone from Google is attending and perhaps there's no one from nature here today although there was yesterday we've been talking mostly about how to make things work and accessible in an open sense and the constraints that Google has and the constraints that nature has are slightly different and also the resources might be a little bit different and so I'm just wondering if there's any generic I'm I'm new to incf and don't know where it hurts and scalability or transformation between different types of data sets and so forth but is this a reasonable or simply a naive question to see what we could learn by other by the other contexts of other data availability systems maybe that's a more general question well let's have the panel take a shot at it