 I'll be talking about some of the changing paradigms for management publication and sharing, which ultimately is really leading to what I would refer to as open science. In the next few minutes, I'll be covering several things. One, I'll start off with some brief definitions. Talk about some of the benefits of data sharing, which will probably be intuitive to most all of you. Then I want to focus some examples on data sharing in ecology in the United States and provide a brief history in terms of what's happened there over the last few decades. Then I'll talk about some of the challenges and solutions to open science and then conclude with some best practices for promoting open science and then also what we have in store for the future. Without further ado, data sharing is pretty obvious in terms of what the definition is for that. This comes from Wikipedia, but I think we've all agreed that this is a fair definition. It's just basically making data available for use by other investigators. So that's a simple definition. If we look at a more comprehensive and useful definition today, this one that's put forth by the Open Knowledge and the Open Definition Advisory Council in October of 2014 I think is a really good starting point. They define an open work that supports open science following the three key principles. The first one is there's an open license, such as a Creative Commons license that's associated with the data product. This includes the freedom to use, build on, modify, and share the data product. Secondly, refers to accessibility and this is the data product should be ideally available via download from the internet without any financial charge for that. And then lastly, and I think an important one for us to think about, is the open format side of it. And this refers to the fact that data should, again, ideally be machine readable, available in bulk, and then provided in an open format. Or at the very least can be processed with some kind of open source software tool. And again, this promotes the open science and open work type environment. So with respect to data sharing, I think, again, we're all probably familiar with a lot of the benefits of data sharing. The most commonly cited one is that it's for the public good. So data are valuable products of the scientific enterprise and they should be treated as such. Secondly, is public trust. And we've seen a lot of examples, the literature, the last few years that have focused on things like Climate Gate and other real challenges to the scientific environment about misuse or misinterpretation or, in some cases, fraudulent data that have been produced. So this, again, creates a need for enhancing the public's trust in science. A third key component is one of the benefits that we've seen documented in several publications, including some by Heather Biobar, are the increased credit that scientists get from sharing their data products. In this case, if you make your data available as a product, then it's more likely that your publication is also going to be cited more than those publications that don't include or make available the data. And then lastly, there's one that's been appearing on the international radar screen lately, which is association with human rights. And this is that sharing data or availability of scientific data is considered a human right by the UN and other international bodies. But from my perspective is an environmental scientist's background is that by sharing data, we can more easily and readily tackle some of the grand environmental challenges that we face today. And that's simplified by all of these different magazine covers from Time Magazine, the Economist, Science, and Science, and others, again, that focus on many of the challenges like climate change, energy usage, and so on that we're faced with now and will be for the next probably many decades. So if we step back and think through how data sharing has evolved in ecology, I'm going to use the United States as an example here. One, I'm more familiar with it. And secondly, there have been some advances there that have been adopted internationally that touch upon briefly. First of all, if we go back to the international biological program, this was like the first decade long, large scale international program focused on ecosystem science. And this was done in a number of biomes around the world, forested, grassland areas internationally. You can see the example from the stamp on the lower right, the government of Canada created a stamp recognizing the importance of the international biological program. And again, it's been implemented in many countries around the world at different pace, but all during that one decade. Dave Coleman wrote a book on called Big Ecology that focuses on the international biological program and many of the subsequent programs that evolved from that. The thing about IVP that was interesting was it was really geared from the inception as being a program that would facilitate modeling and synthesis of data across all of these different biomes internationally. And to have that as one of its goals, it was interesting that John Porter and Tom Callahan in a 1994 analysis looked at that program and reported that data policies and protocols were never elaborated nor even agreed to in principle under the IVP program. There were some major successes coming out of IVP and that was largely due to a number of smaller working groups or synthesis efforts that individuals contributed their data, but it was done in more of an ad hoc type fashion. And I think arguably most of the data that were collected under the IVP program are no longer readily accessible for use by scientists. So that was a first sort of stepping stone into this whole concept of open science and how do we support that type of an effort. And it was not necessarily a big success. If we fast forward a couple of decades in the U.S. from the mid-60s and 70s up to the 80s, then we had the inception of the Long-term Ecological Research Program in the United States. This started in 1980 with I think six initial sites that were funded. There are now roughly two dozen sites in the Long-term Ecological Research Program in the U.S. The first decade there were no real specific guidelines with respect to how data were managed within and across sites and that created some real challenges that were recognized by both the National Science Foundation that funded the program as well as many of the researchers that were involved in the projects themselves. This led to in 1990 LTR guidelines for site data management policies and this came out again in 1990. The challenge with it was it laid down some guidelines but every site was sort of given lots and lots of leeway in terms of how they implemented those recommendations. So there was again a bit of a lack of consistency with respect to how data were managed and shared across the network. By 2005 this was recognized as being a challenge and all of the site principal investigators got together and came up with a much stronger policy that required that data standards data requirements be standardized across the entire network and that was approved by the LTR Coordinating Committee in 2005 and then under the caption the figure caption there you can see that since then we now have about 20,000 data packages that are readily available freely downloadable through the LTR data web portal that have been created by the LTR program. So that has been a huge success and it's led to lots and lots of synthesis efforts subsequently. There have been some external factors as well that have influenced data sharing and data management policies in the U.S. The first one steps back to the National Science Foundation which in 2001 released its policies for data sharing and they they had the expectation that investigators would share data and other results of the scientific process within a reasonable time and at incremental cost. Under the Bush administration in 2007 we had the America Competes Act which required procedures be put in place to facilitate data exchange across all the different federal agencies in the U.S. and that was recently strengthened this year in fact about a month ago with the NSF Public Access Plan which describes the implementation schedule for sharing both publications and data in public repositories. These are all expected to go into full effect in implementation by 2016-2017. In addition under the Obama administration in the U.S. we've had the last few years a big focus on what's called the Open Government Initiative and this also was recently updated just a week ago with some of the formal guidelines there that require different levels of access and openness to both data and publications. There are two major studies that have looked at data sharing across the scientific community. This is one from Wiley Publishers that was just released fairly recently and then about four years ago we had another study that was completed by one of my colleagues in the data one project Carol Teneper and several of her colleagues and that one focused on the environmental sciences community in particular and I'm going to share some results from both of those studies so first of all with respect to the Wiley study I think one of the really interesting results there was if we look internationally at data sharing we've passed sort of the tipping point where most scientists now agree with the statement that they are quite happy and interested in sharing their data. Ten years ago this would not have been the case and I think we're five years from now we'll probably see even a much higher percentage that agree with that sentiment. There are of course some differences across countries and some differences across disciplines as well. Some of the reasons that researchers are hesitant to share their data are highlighted on the right side of the chart here but I'm going to go into a little bit more detail in this slide here and group some of those challenges together and recognize that there are really four major impediments to data sharing. One is researchers want to make sure that they receive proper credit and attribution for creating the data products that they do. Secondly and I think a challenge that is by and large still with us is the fact that many of the tools that investigators have access to for managing data such as metadata management tools have not been particularly user friendly or necessarily readily available and I will highlight one particular exception to that later in my talk. Education has been another key area where I think most researchers would argue that they need better education about several things. One is best practices for managing data. Secondly and I think this is universal probably all of us on this webinar would agree that it's very difficult to fully understand legislative responsibilities and other issues associated with intellectual property rights confidentiality and ethical aspects. The legal jargon can be quite convoluted and in fact there are real challenges when we cross international boundaries what may be legal or appropriate in one country may not necessarily be legal or appropriate in a Jason country so we definitely need much better education respect to that. The third sub example here under education is perception and clearly in the past a lot of scientists have felt like well if I share my data I'm likely to be scooped, be misinterpreted, some of them misuse my data. One that really got me was about 10 or 15 percent of the Wiley respondents so that they felt their data were not relevant and I think if I were in particular seeking additional research funds from a sponsor I would probably not admit that my data are irrelevant but anyway I think education has gone a long way to help flip the tipping point about the perceptions for Lastly incentives and disincentives and encouraged data sharing that's clearly recognition that for things like the tenure promotion process and so on we need to make sure those incentives are in fact there to support researchers for sharing their data. I wanted to highlight a couple of figures here just to emphasize some of my prior comments and amplify those to a bit. So we look at the upper left panel on the quad chart here we see something refers to the long tail of orphan data and one of my colleagues Brian Heidhorn proposed this several years ago and I think it really makes sense in that most investigators when they really try and deposit their data or manage their data they recognize that there are some big well-known repositories out there probably some of the more commonly known ones are GenBank and Protein Data Bank for sequence data and protein structure data in particular and communities have rallied around those and now it's a status quo to deposit your data in those particular repositories. Many researchers though have not had access to similar type repositories although that is changing and those that have not in many cases archived or attempted to preserve their data on their own laptops or desktop machines or possibly in a university or some other location. In many cases those sources are not secure for the long term and we know what data being orphaned and ultimately being lost over time and this is also amplified by the figure in the upper right which is a figure from a paper by Tim Vines again one of my colleagues published this in 2013 and it's a really great story about how data undergo entropy over time how they're lost over time. He and his colleagues surveyed a large number of data publishers of journal articles to determine whether or not the data were still available and they found that over time in a fairly rapidly drop-off the data were lost again over a 20-year period a large percentage of the data were totally unavailable after a 20-year time frame and there is a fairly steep gradient in terms of the loss of information over time. I think another point that really amplifies the need for education is illustrated in the lower right bar graph which is from a colleague of mine Carol Teneper who was published in PLOS 1 about four years ago and she was able to document the fact that most researchers did not use a metadata standard for creating their metadata. The second highest response rate was that researchers said that they used a metadata standard that they created in their own laboratory which arguably is not necessarily a community-wide standard and then only a much smaller fraction used some of the main community-wide standards like international standards organization in 191.15 for geospatial data EMLs for ecological metadata language and several others there. And then lastly the picture of the smokestacks in London industrial period highlights the fact that most scientists if they are interested in discovering data really have very little idea about where to go. There are many many repositories out there a lot of them are small as indicated by the tiny smokestacks there. A few of them are quite large like maybe the gym bank smokestack and a couple of others but it really is difficult to find data that have been archived or preserved in many of these smaller repositories. So in the next few minutes I'm going to talk about some of the solutions to these challenges and first of all I'll highlight the fact that with respect to credit and attribution many scientific journals especially the big name ones like PLOS nature science ecological monographs and others now require authors to share their data that are that underlie articles in those specific journals. And also importantly there are quite a number of new journals that are emerging that are called data journals and some examples there include the geoscience data journal giga science for extraordinary large data sets nature publishing groups scientific data and then one that I've been involved with for the last decade thereabouts is the ecological archives for publishing ecology data papers. In addition to data journals there are some important data repository solutions out there one of which many of you have probably heard of is the dryad digital repository. This is geared towards publishing data that underlie scientific publications they're roughly 75 or so major journals now that are members of the dryad consortium plus there are roughly another 100 journals I think that have had their data published in dryad by the authors. This provides a mechanism for again linking providing access to the data for the long term and then linking that to the publication and I'll show you how that works in the next couple slides. So in dryad an author when they're submitting a manuscript to a journal they are requested by that journal to also submit the data to dryad and then dryad makes a passport available to the reviewers of the manuscript so the reviewers can look not only the manuscript to see whether or not you know the findings are well described but also they can look at the underlying data as well via dryad. If the paper is in fact accepted for publication then the data are made available at the same time and importantly there's a recommended citation for both the data and the paper as well. So what that looks like is this if you in the upper left square gray box you can see there's a journal paper that's in Systematic Biology and the data and metadata files are in a package that is below that which you can easily access and then importantly dryad the repository links back to the journals as well so that someone looks at the data they can also go back and read the journal article and see where the data came from and get more information that way and importantly dryad provides again that recommendation for citation for both the paper and the dryad data package so that authors are essentially getting credit for both the papers that are produced as well as the underlying data products and this is what it looks like in the literature so here's a paper by Joseph Mascaro and colleagues and they're citing a data product in the dryad digital repository by Xan at all that's accessible through that digital object identifier and the dryad citation so in the next little bit I'm going to cover some of the tools that I think are instrumental in helping promote open science more broadly and I want to cover a few elements of the data lifecycle that's illustrated here going from data management planning through collection assurance preservation analysis and so on so with respect to planning there's one tool in particular that's proven extremely valuable in both the US and the United Kingdom and that is the DMP tool in the UK it's a web accessible version in the US it's a downloadable package that you can access and this is what it looks like here at the front end web page for the DMP tool and what it does is I've signed in as myself through the University of Mexico here you don't need to belong to any particular university in order to do this it's easily downloadable and what this does is it steps you through the process of creating a data management plan that's now required in the US and UK and by many funding agencies including private foundations such as the Wellcome Trust and Gordon and Betty Moore Foundation and others in this case I showed the National Science Foundation its requirements for the Biological Sciences Directorate and you can see in the lower right there's an open blank space where an individual would create their response to a set of questions about data collection formats and standards and the University of Mexico above that provides some guidance with respect to the answer that is provided for this particular template so the DMP tool steps you through all the basic requirements for a good data management plan that would satisfy a large number of different funding agencies then this can be published and you can in fact share this data management plan with your colleagues and others as well so some other tools that are very important morpho for supporting metadata creation and management this is a package that can be downloaded and is used by anyone is great for in particular dealing with ecological environmental many other types of observational data what it looks like is this this is just one example of the resource screen and you can type in the name of the submitter for this the creator of the data set other information such as an abstract keywords using the Saurus such as maybe the NASA's global change master directory the temporal coverage for the data set spatial coverage and so on and then you can upload the data file as well in addition morpho provides access to a number of other screens where you can go into much more detail about the data that is provided and as part of the metadata and it can be updated easily updated or revised over time so it's a very useful package in that respect under the preservation umbrella I wanted to highlight a couple things or one in particular here to start off with the re3data.org again many researchers are not sure what public repositories exist worldwide this is a great resource for that it's a constantly growing this is a couple weeks older maybe a month or so older in terms of when I uploaded this slide at that time there were about a thousand reviewed repositories I'm sure it's much higher now but you can do a quick search here for under maybe a variety of keywords you might enter and it'll point out those repositories that meet those particular needs and you can or you can just browse through the entire catalog if you so desire this is a great way to discover what data repositories exist for different scientific domains and fields with respect to discovery on this is where I want to point out a couple things there are you know clearly a number of approaches one can follow in discovering data using things like Google or Bing or other search engines but quite often they don't lead to the types of data that you're looking for so I want to introduce data one this is a project that I'm associated with in the US it's an international program to federate across data repositories and we have three components to the data one infrastructure one is what we call coordinating nodes and these provide a lot of the the broad services for replication other network wide services the indexing and search tools are available through the coordinating nodes and essentially metadata from all the associated data repositories that are part of the federation are searched as part of data one services the what we call member nodes are all of those different data repositories worldwide now that we have a couple from Australia that will soon be made available through data one infrastructure as well including the icos web portal and these are again all the different repositories worldwide that have actually host the data but they have shared their metadata with the data one cataloging and indexing service there's a third component what we refer to as investigator toolkit and these are a variety of different tools where we in most cases provided a direct linkage to the data one data resources so a tool like one are connects data one with the our statistical analysis program allowing researchers to easily access data one data data that are held in any of the data one affiliated repositories do their analyses and possibly generate some new data that then might be uploaded to again one of the affiliate repositories so this is what the website looks like feel free to check it out it's data one dot org basically you might click the search button at the very top and then that would lead to a more advanced tool search here which in this case I've just typed in the word tree I could have specified a narrow range of dates or specific countries or typed in a bounding box typed a state within the US for example and it would have narrowed down the search based on those criteria so looking at this broad response and I got back this next screen which listed a number of data sets at the very bottom counted at all on growth mortality of tropical tree species in India and lots of others and I could using the fasted search tool above do some additional constraining on the responses by focusing on data from a particular repository or by a particular author or even add in some additional keywords so it's a very effective way to identify and get access to scientific data and in this case we're looking at the metadata for the content at all paper I mentioned previously on tropical tree species in India and if we read through the if we scroll down and looked at the entire metadata record here we may actually decide we want to download the data and we can do so by clicking the download button or download both the data and the metadata that are associated with the data files so this is what it looks like these are the various data files and metadata records associated with that particular data set again we can download the entire package and have ready access to that on our laptop or desktop machine so with respect to data one we're now have about 30 large national international repositories that are part of this and we're now hitting roughly half a million data products that are associated with the data one repository another area that is really helping facilitate open science is in the area of analysis and visualization and I wanted to provide a couple of examples here there are a number of tools like kepler de verna and vist trails that make it possible to create workflows or scientific workflows that string together a lots of complex analyses that we can then share that workflow with others people can possibly repeat the same set of analyses we've done or modify those workflows change them to their particular needs and then possibly upload the workflows again to another site where they can be downloaded reused so this is one of the workflows I just wanted to mention in case you're interested in some advanced visualization tools this one is called vist trails it's an open package that you can easily download and use it to create some quite sophisticated visualizations that's depicted on the right side of the panel here and in addition vist trails does some really nice add-on services that collects provenance data for how the data products are in this case the graphics were generated so it's easy to look back and see the sources of the data that went into that and so on and then there's a nice tool that's been created through carol gobel and david war in the united kingdom this is called my experiment and it's a great way to upload your workflows from a whole variety of packages including taverna kepler vist trails and others and in this case we see a paper or a workflow that was created by paul fisher there's an abstract about the workflow but it does there are ratings by the community that have rated 4.6 out of 5 and you can see how many times the that workflow has been viewed and then it's also been downloaded 1600 times and you can if you care you can critically click the green arrow on the right side and actually download that workflow attempt to repeat you know rerun the same workflow or again modify it and upload it back to my experiment for others to use training has been really key as we've been again moving into this open science framework we've done a lot of training through data one and there are lots of other groups that have been involved in supporting training as well one of the things we've done in data one is also make an addition to hands-on training which we do at various professional society meetings in the u.s. and elsewhere but we also have created what we call best practices database so you can either click on one of the elements of the data lifecycle that shows up in orange the bottom of the screen so if you were interested in you know what are some qaqc mechanisms you can click on assure and it will bring up some best practices with respect to that there's also a you can search entering in your own keywords for various best practices and then importantly in the center there there's a thing called the best practices primer which we created this in response to the community requested that there be a more or less a data management a very simple data management guide that could be easily read and digested and they could immediately start to manage their data better so we've created this primer on data management it was nine pages long very short you can download this share it write it to your students in your classrooms and so on and it goes through all the best practices with respect to the data lifecycle there are pointers links in the document to additional tutorials and other information about managing data you can access as well so i want to conclude with a couple things first are just some basic rules for basic best practices for data sharing and contributing to open science the first one is to i think it's very important to create a data management plan and if you have access to a tool that allows you to publish that data management plan do so this really helps formulate a good solid practices for managing the data before a project gets underway and what you're doing is basically stating how you're going to manage data during the project and then after the project is completed secondly is to use some of the tools like morpho to document your data to the maximum extent possible and this means creating the additional descriptions about the data so that someone that's not familiar with the data can understand interpret it use the data correctly so this requires again lots of details about the methods that were employed where the data are located formats lots of other information and morpho is a great tool for helping you step through those requirements for developing a good solid description of your data and then lastly is to preserve the data in a ideally a community repository and then if you follow all those steps then you've created a data product that should be ready and sufficient for data sharing discovery and reuse by others the third recommendation is to publish your data metadata in either a data journal or something like dryad which is an open digital repository so that you and others can easily go back to the data that are associated with publications and then possibly reuse and continue to build science based on that earlier data set fourth in addition to data it's quite important that you and others be able to understand the methods that went into creating that data set and possibly analyzing and interpreting it so this is where the aegis of the fourth recommendation here which is to publish your analytical workflows and software workflows can easily be published in my experiment and then many scientists now use additional repositories like github to archive their software code for the long term and share that with the community as well and then lastly I would argue it's important to publish results in open journals with lots of them out there now plosperon ecosphere that provide free access to the publications that the scientists created so where are we heading in the future I want to highlight just a couple future directions in the open science movement the first one and I think this really encapsulates a lot of information here so I'll step through it from the top down the very top we see a generation of ideas that's sort of the first step in the scientific process and a lot of scientists nowadays are getting ideas from places like science blogs twitter feeds and others and then on the right side it shows you know what is your generating ideas we want to share those you can in fact do so via something like an open laboratory notebook of which there are many on the market many free and open source solutions as well and you might in fact do this as you're developing a research project create an open lab notebook and share that information with the folks that are working in your laboratory or close collaborators or others the second step planning or research and writing proposals again you can get lots of great ideas from literature that you might discover through mendeley or research gate other locations and then as you're developing your proposal you may in fact do that through somebody can open up to your colleagues like google docs which is how I've written I think the last five or six proposals that I've written in terms of undertaking research and then going through that whole data lifecycle there are a lot of tools that you can take advantage of and then lots of places where you can deposit the products as well and share those so on the right side for example we see the DMP tool github for depositing code associated with maybe organizing and managing data quality assuring the data KNB is a repository for archiving the metadata and the data products and then workflows again can be archived in my experiments which were derived from let's say our Kepler distrails or other workflows on the left side and then when we go to disseminate results there are lots and lots of places we can do that nowadays we can share our posters via things like fixed share our point presentations via slide share code versus github reprints via peer j for example is on others then publications the open source mechanisms like plus and then data metadata through a variety of different public repositories now in data one we're doing a couple things to help promote open science in the future one is working on a provenance tracking system now and the way this works is we're looking at historical CTD data from an oceanographic cruise these were data that were collected in 2014 we can actually look at and download the data via data one and on the left side it shows the sources that might have gone into creating that particular data set this is a product that will be released in another year or so so it's not finalized yet and this is part of the usability testing we're doing but this is more or less what it might look like so we see the sources for that particular data set on the left side and then that data set historical CTD data from the Gulf of Alaska may be used by two publications subsequently and those will be highlighted and clickable on the right side so this is again one example of being able to promote that reproducibility of science and being able to document where data sets were derived from and how they were subsequently used and then lastly I think another key activity that we're involved in is creating semantic annotation tool and this is for data originators as well as others that may have used a particular data set and want to come back and add in some some notes to that so in this case we see an example where several people are adding in different comments about the data set and this may help amplify some of the methods that were used by the the data creator or then there may be a couple red flags that were identified and questions that were identified by users that could then be responded to by one of the data originators for example so this is the equivalent of adding post-it notes to data products so they can continue to be used and gain value over time the last slide I had here was on the this whole topic of alt metrics again I think this is really helping lead that trend towards open science and that there are mechanisms now like impact story and this is a creation of Heather Pivar and some of her colleagues so it's an enterprise where it tracks the contributions of researchers in a whole variety of different areas that will highlight for example the number of papers you've had the number of downloads of those papers number or the number of citations of those papers the number of tweets that you've had on your twitter account and so on and then lots of other ways of documenting scientific productivity being this alt metric approach so again this is something that I think we'll see more and more of a focus on alt metrics over the future and I'll conclude with just this again this is our website data one dot org these represent a lot of the communities that we've tried to work with over the last several years of creating data one the top left there's a senior scientist associated with global change research program the bottom left is a librarian associated with the University of California that is interested in providing education resources for faculty members and associate with the library and then on the right side is a young investigator associated with the lake by collar research program who's interested in their reproducibility as well as providing tools to her colleagues on the project and your attention thank you