 It's time to begin. Welcome back to the closing plenary session for our fall 2011 meeting. I trust that you have had a very fruitful day and a half since we gathered yesterday. Certainly I've had an opportunity to check out some wonderful breakout sessions, and I've heard a lot of very positive comments about breakout sessions that others have attended and others have conducted as well. I also have been hearing, at least informally from a few of you, some considerable support for the restructured schedule with the extra round of parallel sessions, and I'm glad for that feedback. I would invite those of you who may have views on that that they haven't had a chance to share to keep them in mind when you get our email with the meeting evaluation in a week or 10 days. Certainly it will be helpful to us to have that feedback in structuring future meetings. I don't have a lot to do before introducing our closing plenary speaker, but there are a couple of things I do want to do. First, I want to remind you of the list in your white registration packet of upcoming meetings. In particular, our meeting in April in Baltimore. I hope that we'll see many of you there. Next December, we will actually be back in the district for our December 2012 meeting. Gosh, that seems so far away and so soon, too. Anyway, so we have got the dates for all those sets. You can find them either in the packet or on our website if you want to mark your calendars. I'd like to ask for a round of applause for the large number of people who came here and contributed their time and their insight in all of the breakout sessions. That really is the heart of this meeting and it is greatly appreciated. So let's thank them all. I'd also just like to take a minute to thank all of the folks who make these meetings run as smoothly as they do to Jackie, Sharon, Angelo, Diane. They have been, as always, a wonderful help in making everything go smoothly. Thank you. And with that, let me move on to the main reason we're here. You're going to hear this afternoon from Professor Bill Michener. Bill is at the University of New Mexico. He is a scientist who's done a tremendous amount of what I would characterize as genuinely multidisciplinary work around the life sciences, the earth sciences, informatics, exactly the kind of sort of creative synthesis and heavily team-based multidisciplinary work that I think so well characterizes much of the 21st century scientific enterprise. He is also the director of the NSF Data One project, which is their flagship data net project. Data One is a very, very interesting project, which I've gotten to know a fair amount about as I've had the privilege of serving on their external advisory committee for the last 18 months or two years or whatever it's been. And that's been a real wonderful experience because this is a project with a lot of moving parts and a lot of connections to other things. It's a very strategic kind of multi-piece project that's intended, I think, to complement and connect up with a whole range of other important initiatives going forward and to make the whole much more valuable than any of the individual parts. It also, I think, recognizes that this is not just about technology, it's really about culture and sociology and changing consensus about practice and about relationships and interconnections and not just in a sort of a study at way but a real build new institutions and new bridges kind of way. And I think you'll see that ideas, for example, about citizen science and citizen engagement run very deeply through some of the underlying thinking here. Bill has been just a wonderful leader for this project, shaping the vision and coordinating a small army of participating institutions and leading them forward. Today he's going to, I think, do two things. He's going to talk about the way science and higher education and research institutions are changing in the face of all of these developments. And then he's going to use that as a context to help us understand some of what data one is hoping to achieve. So I think you'll find this to be a very thought-provoking and informative talk. And I'm going to turn the podium over to Bill. If you want biography, there's plenty of it on the net or in the book. Welcome, Bill. Well, thank you all very much. All I have to say is, wow, this has been a great meeting. This is one of the, I think maybe the only meeting I've ever been to where I've always wanted to be in two or three places simultaneously. And that's quite difficult to do. What I wanted to talk about were several things. One, some of the new paradigms I think that are shaping science and academia. And these are accompanied by some fairly significant challenges that we all face in our daily dealings. And then I want to introduce data one for maybe about 20 or 30 minutes and talk about it and our approach for trying to deal with some of these challenges that I think we're facing. And then I'll conclude with a few remarks about the future. So first of all, some of the new paradigms. I think we're entering overall an era of grand challenges. And this can be interpreted as grand challenge science, grand challenge humanities, grand challenge scholarship. You name it, I think it's a time of grand challenges for all of us. This particular graphic here indicates some of the real driving challenges in the environmental sciences I think that are facing us right now. Things like global climate change, population redistribution and impacts on water resources, and a whole array of things that have not only scientific implications, but some fairly significant societal implications as well. And hence, one of the reasons I'll refer to them as grand challenge questions that I think again will be dominating science for the next probably decades. This is indicated just even looking at the budgets of various funding agencies. Here I show the 2012 budget for the National Science Foundation. And you'll notice that the major new investments there focus on three areas of what I would term grand challenge issues. One is clean energy and developing those new sources of energy that decrease our dependence on foreign oil and so on as well as have less of an impact on the environment. Secondly, something called C's which stands for science, engineering, and education for sustainability. And this is roughly a billion dollar effort out of the next federal year's budget that will focus on this particular activity. And this crosses virtually every directorate within the agency of NSF. And then finally cyber infrastructure for the 21st century, which will be essentially the information technology platform that will support those other two initiatives plus many other initiatives within the foundation. I think a second paradigm that follows from this is that data intensive science is going to rain. And this is probably something that now would be surprised if there's not a single person in the audience that hasn't seen this book cover from the fourth paradigm. But if you haven't look at it as really I think a game changer in the literature and the way that we will be doing our science for the next few years. In particular, I think it leads to the fact that we're going to be searching for new ways and new tools and new ways of working as we try to deal with these massive data streams that will be heading our way. And that leads to sensors. One of the things that I think is going to be driving this data intensive realm is the fact that at the end of this year, probably we've already surpassed this, I've not checked the most recent numbers, but it was projected that by the end of 2011 we would have over one billion sensors in the environment collecting data. And these would be from streams, atmospheric, or science type data that all of which can help address those grand challenge questions that I showed earlier. I also include in the lower right a picture of a human here because human observers also represent key sensors in the field. And they are the real genesis of what now is referred to as citizen science or public participation in scientific research. They also come coupled with their own sensors. I think currently there's four billion cell phones in the world, many of which have GPS capabilities and other sensors associated with them as well. In addition, I think we are truly entering into an era where average citizens are going to become more and more involved in science. And one of the groups that I've been involved with now for several years is in the upper left, it's called eBird. But that's a group of roughly 30,000 active participants that contribute bird checklists on at least a weekly basis. The actual number of individuals is much larger than that, but these are the average number of really active participants. And this is growing on a daily basis because eBird just within the last six months, I believe, went international. So there's now the majority of the countries in the world now have an eBird citizen science program in their country. There are literally hundreds of citizen science programs out there. And I did just a quick search in the web and discovered these various logos here, but there's probably an order of magnitude more citizen science programs that I'm not showing on this particular slide. One thing if you are interested in this, we are holding our first citizen science workshop or symposium that will be held this August in conjunction with the Ecological Society of America. And we anticipate probably three to four hundred participants of that representing, again, this whole array of citizen science programs. I think another element of this particular paradigm is the fact that teams of scientists are now having to work together to address these complex challenges. And the reason for this is that we're dealing now with problems that require us to work across multiple temporal, spatial, and thematic scales. And this is hard science. Most of us, as we were going through school, we focused on a fairly narrow subset of a scientific problem. And we didn't have to work across disciplines. We didn't have to work about working across different scales of space and time. And this, again, presents, I think, some real challenges that we try and do so. Increasingly, there are, there's instrumentation and support for all these different scales that I show on the bottom. Clearly, we have a lot of data coming in through MODIS and other remotely sense platforms. And then more recently, if you look at the top of the pyramid there, we have an increasing number of programs supported by the National Science Foundation in the U.S. as well as similar programs and platforms internationally that are collecting very intensive data at a fairly small number of sites within a particular biome or region. The third paradigm shift that I think you are aware of as well is the fact that data for the first time over the last year or so are being recognized as valuable products to the scientific enterprise. And this is something that I've actually always found a bit troubling on my entire career, almost, has been based on soft money support through the National Science Foundation. And I know when you get to the end of a project in NSF, what you submit is a project report that basically outlines the papers that you published and the graduate students that you supported. And there's little impact on other contributions that you may have made such as data you created, software you produced and so on. And that's all clearly changed in the last year through NIH, NSF, IMLS, and other foundations and so on that are now insisting upon at least data management plans be presented as part of the proposal package. Dryad is one of our member institutions in the data one project. In this case, they recognize data as valuable contributions by providing digital object identifiers which show up in that yellow circle there. And they also provide in the lower boxes there recommendations for not only citing the publication, but also for citing the data as well so that contributors get credit not only for the work that they did in terms of producing a paper or a book or whatever, but also they get credit for the data that they curated and managed and published as well. Fourth, libraries I think are going digital. I think it's a few years down the road before we lose all our most paper in libraries, but I foresee that coming. And along with this, there's I think a big change in terms of how libraries are working. This is just one example of a project we're working on at the University of New Mexico that was recently funded by Tony Hillerman's daughter as a matter of fact. And this is bringing together all of his interviews, manuscripts, drafts, every piece of information about Tony Hillerman available so we can serve a particular scholarly community that is interested in his works and going back through and looking for connections and so on. And this is just one example I think in terms of the types of activities or collections that libraries will be developing over the next few years. In addition, libraries I think are going to be changing the way they look and feel and act. We're going to see a lot more advanced infrastructure in libraries, things like visualization walls, collaboration spaces, and so on in areas where people can get to work and deal directly with data and information to help create knowledge within that particular environment. And we're also going to clearly going to see much more in the way of collaboration spaces. I know that our university were continually shrinking down the volume of books that we have via movable stacks and other mechanisms to create more space for building collaboration spaces and adding in technology areas. And then fifth, I think one other big change is the fact that this ability to deal with data is going to become the new statistics. And we need to recognize that and I'll talk more about this in just a few minutes. So those are the five paradigms I think that are accompanying this new era of grand challenge science or grand challenge scholarship. But there are also some significant challenges that we face. This article here comes from William Brinkman, who is head of the DOE's Office of Science. And he argues, I think very vehemently, that we need to really push for federal support for basic research. And there are several reasons for this. One is that industry, commerce and industry have largely stepped back from a lot of the basic science support that they have done in the past. Similarly, other institutions don't have the dollars and wherewithal to support basic scientific research. And he argues in his paper that, and I think we've all seen this, that support for science historically has been a bipartisan issue. And there has unfortunately been some partisanship entering into support for science and STEM education. And I think all of us really need to fight as hard as we can. Another challenge that builds upon one of the previous slides I showed is that big science also again requires working across these different scales. But in our studies in data one and our interviews with subjects and communities, it's evident that currently most scientists are spending about 80 percent of their time dealing with fairly mundane data management aspects. And I refer to that and include in that things like merging different data sources together, doing a lot of the manual documentation, a lot of the more trivial aspects of science, trivial in the sense from their perspective, in the sense that they would prefer to spend their time doing the analysis and interpretation of the data that they are collecting. So, you know, there's a real balancing effort that we have to play here. And again, scientists feel like a lot of their time has spent doing things that they again would really have prefer to have automated or prefer to have other alternatives. This is from a paper I published back in 1997 that led to ecological metadata language, a metadata standard being created, as well as a biological data profile on the USGS. But it points to the fact that, which I think is still true, and that is that we have a constant problem associated with data entropy. And that is that, you know, most of the data products that we produce in science, they always continue to lose their value over time because we lose the understanding of what those data mean. We lose the context for why the data were collected. We lose information such as where the permanent plots were that we collected data from originally. A lot of the details associated with collection and analysis and so on. And over time, you lose enough of this information and that renders that data set meaningless for reproducibility or usability, let's say, in a meta-analysis or a synthesis effort. So, I would argue here that all of us need to really encourage and promote comprehensive metadata to the fullest extent possible so that, again, we can retain the meaning of the data products that we're producing. Another challenge that we're faced with is, was pointed out by Jim Gray and others. And it really highlights the fact that there are a number of small number of repositories that hold a tremendous volume of data. And this can include satellite data in the Aero's data center, for example, or GenBank, Protein Data Bank, and others that are widely recognized in the community and, again, hold tremendously large volumes of data. But on the other hand, most of the data sets are actually small, they're distributed, and they represent that long tail of the distribution. And they are often referred to as orphan data sets, again, because there's no repository to essentially watch over them and preserve them for the long term. So this is clearly an issue that I think we all are aware of and need to deal with. And then probably what I think is one of the most serious challenges that we face is what I refer to as stove piping of data. And this happens through the roughly five million plus data centers that have been enumerated worldwide. And these are institutions that have, again, like Protein Data Bank and GenBank that have large volumes of resources and the smaller smoke stacks you see here from the London skyline representing those very small repositories that hold data, institutional or research project repositories, many of which are ephemeral in nature and all of which are disconnected from one another. So if you wanted to go out and identify all the data you needed to answer one of those grand challenge questions I showed on the first couple of slides, you would literally almost have to go to each one of these individual repositories and try and discover those data. And that's clearly an impossible task. So that leads me to data one and what we're trying to do about some of those challenges. And I'm going to introduce four aspects to our project. One, the fact that we engage the community and have since day one. Secondly, we were really focused on building infrastructure, but only the infrastructure we feel it is required. And we're again not trying to reinvent the wheel. And then we're doing a lot of work with respect to education and outreach to the community. And then finally, I'm going to touch upon just our current status and the future of the project. So in terms of engaging the community, there are several ways we do this. One in this young lady here who is a scientist that we interviewed. She represents one of 10 different groups that we call our primary stakeholder community. These include environmental scientists, citizen scientists, decision makers and so on. And we've developed scenarios built around these different stakeholders. And this has informed the infrastructure that we include in data one. And it covers such activities as how that particular individual or stakeholder spends their day. So how do they do their science? How do they collect their data? How do they interact with colleagues? All aspects of, you know, their daily professional life. We also collect, do a lot of stakeholder surveys and I'll show results of one of those in just a minute here. In the lower corner, we do a lot of usability testing of our website or the tools that we're producing. Then in the upper right, we have other working group activities which I'll talk about separately in another slide here in just a minute. So one of the first surveys we did was of the environmental science community. And this shows one of the key results of that which I thought was surprising, which is that more than 80% of the individual survey said that I would be willing to share data across a broad group of researchers who use data in different ways. And this is almost the reverse of what I probably would have expected. But I think the scientific community really is waking up at recognizing that we have these really challenging problems and we do need to be more cognizant of our responsibility for sharing data and exchanging data with others. This is in a paper by Carol Teneper in 2011, published in PLOS ONE, if you're interested in the full set of analyses. Also as part of this though, so this is sort of a corollary to it, again, 80% of the scientists surveyed are willing to share their data, but they have challenges they're facing in terms of trying to do so. And most of them felt that they were, and this is just four problems that we identified. In many cases, the one here, 35% of the academic scientists were satisfied with the established processes to store data beyond the life of the project. 40% of academic scientists were satisfied with tools and technical support for data management during the project. 46% were satisfied with the tools preparing their documentation. And then here's one that's really somewhat confusing and that 62% of academic scientists said that they were satisfied with the process for cataloging and describing their data. But when you dig a little deeper into this, you discover that the majority of these scientists were using standards that they developed in their own laboratory. So these were not metadata standards that were broadly used in the community, but these were just scratch pad approaches they used in their own laboratory to document data and they were quite happy with those. So it doesn't do much for data sharing though. We're in the process now of we have a survey for libraries and librarians out now and we'll be soon surveying data managers, educators, and citizen scientists as well. Let's try this. I mentioned that we also use working group activities and this is one that we created a working group that we not anticipate when we put the proposal in and we call it exploration, visualization, and analysis. And this is bringing a group of really top-notch scientists together, putting them in the same room and saying, okay, what's a grand challenge question that you all would like to tackle? And in this case, we had a very diverse group initially and we identified three or four major grand challenge questions that the groups wanted to tackle, but identified one in particular that we thought we would start with. And this was trying to better understand the continental scale dynamics of bird migration. And this involved pulling in data from 31 different data sources, only show four here. E-Bird, I mentioned that previously, so citizen science data played a key role in this particular study. Interestingly enough, one of the data sets that we discovered or needed was actually in an individual scientist's laboratory in Utah and we found out about it through word of mouth. So again, one of those challenges that we face when we're trying to do these grand challenge type questions. In addition, we needed about 600,000 hours on the tear grid to process the data. We needed a new statistical model for doing this. And we needed advanced visualization tools as part of a workflow in order to support this. And what this is looking at is the Indigo Bunning. The white is the summer breeding territory. If you look over Atlanta, you might see a dark spot. Then it indicates the birds are actively avoiding metropolitan areas on their migration northward. But we did this for about 250 birds and this led to the state of the birds report released by the president in May of 2011 and it's serving also as the basis for future research as well and literally dozens of publications that will be appearing in the scientific literature. One of the reasons we did this was to try and understand how scientists do their work, what the challenges are they encounter, and again data discovery was a key one. The ability to process the data, to get the data to that 600,000 some hours of computing time on the tear grid and then incorporating all of that into a visualization workflow package. These were all challenges for the community and again helped drive some of the architecture that we were incorporating into data one. We also have what we refer to as data one users group that provides an awful lot of feedback to us. They identify for example the tools that we incorporate into our toolkit, which I'll show in just a minute. They review essentially all aspects of the project and provide feedback and guidance to us. So that leads me to the cyber infrastructure component of the project and our goal here was to enable new science and knowledge creation through the ability to easily discover and access data as well as tools that can support different aspects of the data lifecycle which I'll show in a future slide. And again we we sort of started with three precepts. One is that we wanted to build on existing infrastructure. We didn't want to reinvent repositories that already existed. We wanted to build up on those. We wanted to only build some of the key glue infrastructure that was necessary to support that interoperability or interworkability. And then third we wanted to support communities of practice and enable them to do new science. So our infrastructure is comprised of three components. The first and underlying layer there is what we refer to as member nodes. These are actually organizations, libraries, research networks, universities, federal agencies and others. They all hold data. They probably serve a particular community and they probably have some kind of support services for that specific community as well and they retain their own copies of the data and they're worldwide. There's a second component which and the numbers here are indicative of what we included in the proposal initially where our sites are actually much broader than just the small number of dots I'm showing but this is just to give you an indication. The second key component is what we refer to as coordinating nodes and these are essentially that infrastructure we're building that is essentially hosting all of the metadata from all of those individual data repositories at three different coordinating nodes. So there's 24-7 replicated metadata and it's indexed and then easily searchable and the coordinating nodes also provide other network-wide services as well. Some of the security aspects, the ability to replicate data across member nodes so that we have multiple copies of data stored in different locations and so on and so forth. And then the third key component was one that we again and received input from the community is that they wanted to see tools that were tightly integrated with the data one resources. So you could use a tool that you're familiar with let's say Microsoft Excel or R or some other package and you could then essentially mount data one as a drive easily discover the data that you want from data one use those in analysis and then potentially upload results to another repository. So these are the three again major components here it shows essentially a lot of the service interfaces that we're building for the member nodes and the coordinating nodes. We also support client libraries for developers that want to add tools to our toolkit and these include Java, Python and then we have a command line interface as well and one of the things we did was we recognized early on the need to have a tiered implementation for member nodes and we have again four different tiers here the first one is read only public content the second is read only with access control the third is read write and then finally the fourth and the highest level tier here stability of service replication node for other member node data. One of the things we do is we publish all of our architecture documents and so on on this website if you go to there you'll see roughly 500 pages of documentation everything we do is open source and openly available. This just gives you an idea of the diversity of member nodes that we're dealing with initially the one on the left is the Oak Ridge National Laboratory Distributed Active Archives Center for Biogeochemical Dynamics it holds some fairly large databases they have a very high level of curation there these data are all peer reviewed heavily peer reviewed and the data products are often used literally dozens of times for an array of different analysis problems and this is supported by NASA. I mentioned Dryad previously Dryad allows you to publish data at the same time you publish your journal articles and right now there are 30 some journals that are part of Dryad including PLOS which just signed on just a few weeks ago and then on the right is the one where there's the least amount of enforced curation but it's quite a popular repository it's the Knowledge Network for Biocomplexity and there are roughly 25 or over 25,000 data products associated with it now and a number of different metadata standards are supported in this particular repository and this has been supported by the National Science Foundation. This just shows the growth of the KNB repository initially it started out a couple years of development there in this early adopters and then we started an extensive education campaign then a number of large research networks like LTR and PISCO which is a coastal ocean research network all signed on and since then the repository has been growing steadily to again a fairly large size so how does all this work for a member node or a data repository this shows the R1L DAC and you know they've existed like this for close to two decades they've had their data collectors out collecting data and contributing it to the repository they've had a user group community that has taken advantage of those data for again other types of analyses and synthesis efforts and so on but one of the things that was a real attraction for the R1L DAC was the ability to connect up with data one and in this case they have any of the R1L DAC users then have access to data that are available through other data that are part of the data one system as well as the tools that are part of the investigator toolkit and the equally important is the fact that these users in the bottom here from other different programs and so on they can easily discover more easily discover the data that are available in the R1L DAC so this is a real benefit to NASA and it ends up being a win-win for everyone concerned so this is the right now the group of institutions that we're working with internationally to add in member nodes over the next couple years we've hit most every continent and we're actively again adding on lots of member nodes with some fairly significant content worldwide one of the other key components is what we call our investigator toolkit and again this was the idea of providing tools that scientists are comfortable with using on a daily basis and they want to integrate those more closely with the data resources that they're collecting as well so we've through our users group prioritized a number of tools to add into the toolkit if you look under I'll just go around the circle here real quick Microsoft Excel are these are some metadata management systems that are currently in use in different communities discovery tools like Mendeley's, Otero, One Mercury which I'll show in just a minute integration tools like ours as well as some semantic tools that we're incorporating into our project and then for analysis Kepler's a workflow engine are again now for analysis Vistrails is the visualization tool that I showed on the bird project and then MATLAB as well as another key tool again that is widely used in the communities that we've dealt with one of our partner institutions and a member of the data one network recently released the DMP tool which was under planning as part of that data life cycle and this is just taken off like wildfire probably exceeding everyone's hopes and maybe even their desires but it's an incredibly useful tool I actually just used this yesterday to create a data management plan for a proposal I submitted NSF and it worked like a charm but this is a sort of a wizard driven type tool it steps you through the process of creating a very credible data management plan for any of the different programs or directorates at NSF as well as several other agencies that are now being added in as well. One Mercury is our tool that supports data discovery and it starts out and I don't show the first slide here but there's essentially a Google like interface where you can type in one or two keywords and this is the next page that pops up which allows you to do bounding boxes identify specific locations put in temporal windows and so on and then hone in on the areas that you're the honing on those specific data that you're particularly looking for and then you can go even further and apply through fasted search different filters like author keywords originators others can be easily added in like sensors should that information be available and so on and then there's a relevance factor that you see here it's the number of stars that so depending on the keywords and all the search criteria you put in you get a listing of all the data that may be relevant to your project and then a relevance factor that allows you to look more closely you can look at the you can view the metadata and you can also then if you like again if you see something that's quite highly relevant to what you're doing you may not be able to see this but there's another term here called find similar data and this is where again we're incorporating a lot of our semantic tools into this to really ease and make the search much more powerful. Once you've discovered the particular data items that you like you'll see a box up there where we've checked off three different data packages and then you can incorporate those into your own particular library that you see in the bottom here and these are all data packages then that you've collected via your search on through the data one or one mercury tool so you can build up very easily your own essentially a collection of data products that you can then take and work with you can take that next step in terms of meta-analysis and synthesis or whatever your research project leads you. One other element that we're working on this will be in the spring release is what we call one drive and it'll initially be read only but this allows you to mount the entire data one resources as a drive on your own particular machine so you can again very quickly search through providers keywords projects title a whole array of different you know search parameters and do that on your local system. So education and outreach is another key component. One of the things we have started is graduate program that we offer this is a very intensive three week course that we teach for environmental scientists it focuses on informatics advanced analysis and visualization and geospatial analyses and it's essentially three weeks of again very intensive hands-on activities and the students go away with a real solid understanding of the the tools they're using as well as ability to actually put them into practice and that's something that I think is clearly missing in a lot of our curricula these days the environmental sciences is a better ability to use those tools. A couple of the things we're doing in terms of trying to support the community are we have a couple databases I show on the top of what we call best practices this is something also that would help anyone that's filling out the data management planning tool and has a question about you know what documentation standard should I use you can go to the best practices database here enter in documentation or metadata and it'll pull up a lot of information about that and help it's one pagers that will help work an individual through hopefully and answer the questions that they have associated with that one pager which hopefully answers the question but there's also pointers to tutorials additional resources like books and key journal articles and so on. The DMP tool I just mentioned and then we're actually working on the bottom here on what we call learning modules and these are modules that you can take and essentially incorporate them into lectures in your general biology, ecology, environmental sciences, earth sciences, courses, whatever and the idea here is to really try and and get informatics education to the domain sciences where it's really needed and that's I think going to receive a lot of use over the next few few years. We do a lot of outreach to two different groups including um let's see I'm not sure why that animation is not working anyway we do a lot of outreach to societies where we put on training conferences we have built some cartoons that don't work really well but they if when they do work they're they're actually quite compelling and step you through the process of different aspects of how to upload data to data one and so on and so forth. So we're in the process now any day now we'll be releasing our doing our full-blown public release we actually have a website up now we've had one for several months with some of the particularly education resources but the full-blown cyber infrastructure will be released this month and it'll look the website will look a lot like this we've had a lot of support from NSF, USGS, in-kind support from NASA, the Gordon and Betty Moore Foundation, and Microsoft Research as well. So this is what we're in the process of releasing right now our developers are trying to clean up the last couple little bugs and get things ready for an imminent release but we have a number of tools in the toolkit that will be in as part of the first release the client libraries are all there that's to support the developer community so again if you're in a project and you've got developers you can add your own tools to the toolkit you can also probably develop some of the interfaces needed to add yourself as a member node to the data one federation the member node software is right now up and running for seven different member nodes and then the coordinating node services are all in place so again we have our public release in December and then after that we're on about a 10-week development cycle for new products and services that were released over this next year so we'll be adding things like OneDrive in February, VizTrails, that visualization tool will be available in April and again we have again a whole array of tools and services we'll be releasing over the next several months. As part of the data net requirements we follow a fairly tight project management schedule or project milestones as you can see here we're basically hitting all of our targets and it's exceeded some one of which was the amount of storage that we have available through data one which we've already far exceeded our expectations for the first several years there so I present this as a way of sort of closing on this particular topic but that is I've thought recently about how do we judge our success and not only in data one but I think across data net and across probably lots of other cyber infrastructure projects in the room and you know I've I've sort of looked at how NSF views success and how various product projects are viewed as being a success or not and I would argue that in many cases their metrics for success are off target so I started really thinking about this and I want to propose these as criteria for success that we could all use in judging how well our cyber infrastructure information technology development projects are working so first is how quickly can a scientist discover and acquire relevant data wherever they may reside again I think that's one of again I think that's one of those key challenges that we need to have some significant progress made on in the next few years secondly how much time has spent on mundane data management activities currently again I indicated that was you know you could argue there's 80 20 70 30 90 10 whatever but far too much time right now has spent on fairly mundane activities and can we switch that around and have it so that the scientist is spending 80 or 90 percent of their time doing the real advanced analysis and visualization and interpretation which is I think where they get more intellectual stimulation third how fast can data be visualized analyzed interpreted and published again this is all I think part of the daily life of a scientist they they live for publications and analysis and interpretation and getting science done so what are we doing to enable that and speed it up can analyses and interpretations be readily reproduced by others and are they transparent I think this is getting more and more played particularly through things like climate gait and other issues that we see in the news but nevertheless I think it's a key part of science that we seem to have forgotten over the last 40 or 50 years and that is you know reproducibility is a key element of what again what science is all about five can scientists readily discover and use the tools that they need and in going around the country and talking with various groups I mentioned a package like viz trails or Kepler scientific workflow or others that I know could really speed up the process of science and you know large portion of the scientists that I talked with aren't aware that these tools even exist so we have a ways to go in terms of informing the community about tools that they can use that will really speed up their work and make science proceed much faster six how rapidly can a community mobilize to tackle a grand challenge question and that's not only through support of the cyber infrastructure that I mentioned previously but it's also through supporting the collaboration that's necessary bringing different interdisciplinary or multidisciplinary scientists together virtually in many cases to support that work and then finally do scientists feel they are being properly rewarded for efforts devoted to data management and collaboration and this is a real challenge now as we again focus on these grand challenge science questions that can no longer be tackled by an individual scientist and a graduate student in the laboratory but they require collaboration with maybe three or four or two hundred other scientists so how do they get credit and feel like they're being rewarded for investing that effort in those types of problems so I want to conclude with just a couple comments again I argue that we're in an era of grand challenge science grand challenge scholarship grand challenge humanities whatever I think it's all quite related science is becoming much more of a team sport we're tackling bigger and more challenging and complex problems science I think is embracing the fourth paradigm and we'll see more and more of that over the over the next few years and this is true in the humanities and social sciences as well where sensors now are playing a key role in understanding human population and decision making and so on third data have value fourth libraries are looking and acting and behaving very differently and will continue to do even more so over the next few years and then finally data management I would argue is the new the new statistics so how do we usher in this new era that I've described there are three activities I think that we can all in this room perform first of all is to promote it I think we need to get out there and really sell the fact that these grand challenge questions require a new way of doing science so we need to embrace interdisciplinary transdisciplinary collaborative and data intensive science and we can do that through a whole variety of different means but one I think that's really key is bringing awareness to those issues one area that I think deserves really active thought is collaboration and working together as teams and projects and that's something that academia I think has been really want to do in the last you know in terms of and as far as I can remember there have been very few efforts to actually focus on you know how do you bring in let's say a group of undergraduates or graduate students together and work as a team on a problem and share your expertise and that's I think the direction that we're heading education you know we have probably in this room dozens of different library and information schools represented but is that solving the problem I would argue no it's not we need to really inculcate education into the domain sciences as well so it's really key not only to have graduate programs and library information sciences for example but also to really get informatics into basic biology courses basic anthropology courses you know all of the different domains and disciplines and then third is advocacy and I think we need to advocate on several fronts one is for new funding and the article I showed by the chief of science the head of science for the department of energy I think is indicative of the challenges that we face with respect to garnering public support for basic science and applied science research but we need to again clearly advocate for that we need to work with the citizenry and get out at every moment how many of you have spoken before a rotary club or a school or some other group that is a nontraditional anyone I see a few hands that's that's how you have impact you know is to get out and spread the word to the unconverted you know here I feel like I'm preaching to the choirs I'm not sure how useful that is but we also need to advocate for breaking down a lot of the barriers to this new way of doing science and that includes academic barriers one of which is the tenure promotion system that rewards stove piping and focusing on your own single small area of research with your own first name author publications as opposed to working on teams just recently I had a discussion with an individual that was weighing whether or not someone was going to be promoted in an academic department in my university and he was really concerned because this particular scientist that chosen to publish his work in open source journals like plus which you know clearly horrible place to publish secondly they had chosen to do all their research and published it on open notebook and you know again there was there was this view of openness and sharing and you know doing things through open source literature is had a negative connotation which I was really shocked to hear but we need to break that perception down and then funding silos I think are also key and this applies to foundations of all ilks and the again many of the the challenges that we face they aren't they are much broader than any one program at NSF for example or the Mellon Foundation you know they cross different floors within the NSF they cross different programs supported by different private foundations and so on so we need to break down those barriers and make it possible to support again research that will confront some of those grand challenge questions that I showed previously and I just want to acknowledge my team here these are the people that have been essentially with me for close to four years now when we first wrote our initial pre-proposal and they've stayed with us for weekly phone calls and it's been a real joy to work with each and every one of these folks and this represents about a quarter of our or less of our workforce we have a lot of our other work is done through working groups and there are several people in the audience that are part of those working groups that I don't mention but we appreciate the working group efforts as well so I'm available to be contacted at any time if you have questions and also I think we probably have time for a couple questions now as well and I see at least a couple mics over here which you might want to take because my hearing is not the best thank you bill how is data one dealing with the long-term sustainability of data problem let's see the problem is a long-term problem and I don't think we have to worry about sustaining it I think what you meant was how are we dealing with long-term sustainability of data one and that's a tougher question to answer we're looking at a variety of well we'll be talking about this more in January we've been looking at a number of different approaches there some of which involve potentially membership some of which might involve fee-for-service type activities another one is I think foundations and so on including NSF are probably going to realize that they have a role and this is a key stakeholder in supporting data not only in repositories but also infrastructure like data one that makes it more accessible and also supports replication and other activities so I think we're probably going to see somewhat of a change in mindset as we all as part of data nets and interrupts and other projects you know step through the process that you have to consider as part of building a business plan and that is who are the stakeholders who are the beneficiaries and environmental scientists in this case are beneficiaries uh the public is a beneficiary beneficiary of this funding agencies benefit from having their data more easily exposed or readily exchanged and so on so I would argue that you know we have to look at again who benefits and then what the value is to them and then present a case via proposals membership fees whatever to the appropriate parties and have an open discussion with them about you know what do you think your role is in this and I think that's where we're we're going in data one and we're actively engaged with NSF and USGS and others in terms of this dialogue right now so I can't answer specifically but we have a lot of ideas that we're working on as part of our business planning uh oh I'm getting ready to get nailed here by Mackenzie this is an easy one you mentioned one of the mundane data management tasks that researchers hate as the data integration right integrating data from lots of data sets from different sources and uh I think in your grand challenge example there were many tens if not hundreds of data sources okay and then in the toolkit you mentioned one that kind of fits in that space other than just r some semantic toolkit but you know in my experience this is a really hard problem right so is this something data one is really seriously tackling how do you see that fitting into your strategy and where do you think that activity should sit in the future given how hard it is to do and how much domain expertise you need and dot dot dot right that's that's a great question and we are working on it hard but we're doing it with partners and one of the we have I mentioned and I didn't I should have shown this my org chart but we have 10 working groups that are part of data one and they're all associated with different problems citizen science security issues and so on and so forth and one of the working groups is focused on semantics and we have about a dozen individuals associated with that it's led by Debra McGinnis and uh Jeff Horsberg and we have Carl Lagozie and several other members of that working group and what they bring to the table is their um integration with other NSF supporter projects that are focused on develop ontologies and semantic mediation tools and our goal and is to this year to release some semantic tools that will be added into the coordinating nodes the search engine in particular that will help out with the data discovery part of it so again that doesn't get to the integration that's a little bit more difficult challenge uh we're working with a couple other projects one in particular is called sonnet and it's led by Mark Schildauer out of the University of California and several others and it's focused on developing and observation ontology and this is one an area where I think we can have some impact in the next couple years uh to support integration and you know most of the types of data supported in the environmental sciences are based on some kind of an observation either through a sensor or a human observer and they have characteristics like spatial location uh maybe they use some kind of a plot that had a square meter plot or a square foot plot or whatever and you know one of the issues you want to be able to do is where you've got observations that are similar in nature you want to be able to easily do those conversions so take a square meter plot and convert that to you know just you know square foot plot or whatever and that's just a very simplistic example in terms of what we're dealing with the real challenges that are going to be driving this whole area of research for the next decade are in the individual domain ontologies and that's where you end up with problems like uh wave height you know if you deal with uh you know an ornithologist that was involved with eberg project for example wave height means something so different to them it's you know in terms of bird songs as opposed to an oceanographer who views wave height again as something very very different altogether so we end up with at that scale we end up with problems across all of these different domains where people use the same terminology for very very different things and that's where we have to have that context built into the you know that that semantic mediation as well so that's a lot more challenging issue that um you know again will be I think keep a lot of us busy for the next decade or so I think we see any more questions thank you so much and I think Cliff is going to polish over. Thanks so much for that Bill um I do hope that uh you will stay in touch on this you've got a tremendously ambitious agenda and um I know I speak for many of us um in uh hoping that we can follow your progress in coming years uh as you work on these I think that we are at the end of the program uh it remains for me simply to thank you for coming to wish you safe travels home a uh good holiday season and uh everything for a successful 2012 thank you all for coming