 I'm Keith Webster, Dean of Libraries, delighted to welcome you to this latest event in our Open Access three weeks of events. Open Access Week gives us an important annual opportunity to remind us of the importance of our work in this area to share experiences and advise and make plans for the future. I've argued for many years that Open Access, which really focuses on Open Access to scholarly literature, is simply a step on the trajectory towards open science and open scholarship. In many ways, Open Access and its work in the journal space, we've pretty much sorted out. Open Access has overcome many of its initial cultural hurdles institutionally and through government funders. We've sorted out a large part of the policy aspects, and once we sort out the funding element, we can all get over that, tick the Open Access box and move on. I think the really important and exciting challenge now is turning to open data. The big difference between Open Access and Open Data in my simplistic world view is that Open Access to publications really was taking an established business model and changing the financial flows, perhaps teasing out some of the profit from the commercial publishers, but we kind of knew what we were doing. We understood how publishing worked. With data sharing, it's a very different space. We don't have that 300-year culture of sharing data in the way that we envisage through the Open Data movement, and I think that is something we certainly need to tackle. Open Data really represents a tremendous potential for the research community globally. If we can share the results and the observations of internationally funded projects, we offer a tremendous resource for the research community into the future, but for that we have a number of social, cultural and technological challenges that we need to resolve. Now on a campus like CMU you could argue that the technology is easy to do, but it can't be done on its own. We need to recognise that there are social and cultural aspects as well that deserve our attention, and we are tremendously fortunate to have organisations like the Research Data Alliance working on our behalf to tackle many of these aspects. We're delighted to have with us today Fran Barman from RPI who is the US lead for the Research Data Alliance. That's a tremendously important role, but one aspect of a tremendously significant career in academic computer science and supercomputing. Fran has served as director of supercomputing in San Diego. She was on the faculty at UCSD, vice president of research at RPI, distinguished professor of computer science at RPI, as well as US lead of the RDA, fellow of every relevant body you could think of and a winner of numerous national and international awards. So we really are thrilled to have Fran with us over the next couple of days. Today's talk is on data curation, data sharing in a broad sense I think. Tomorrow at 11 am Fran is delivering a presentation in Mellon Institute on more specifically the work of the RDA and you would of course be very welcome to that. Fran was advertised for librarians but we are a broad church and happy to have anyone come and join the party. We have another important event next week as the final part of our open access three week celebrations. There is a flyer on the table at the back where you can get more information on that. If you would like to submit a poster please do so and we have t-shirt awards as well, which always seems to be a draw card and there may be pizza at that event. Fran, welcome. I will hand you the microphone and invite you to say a few words. There are no t-shirt and pizza at this event but no t-shirts and pizza at this event, but I will try to make it entertaining nonetheless. If you have t-shirts and pizza would have helped you have to tell QQ yourself. First of all I want to thank you for always a thrill to come to CMU and I think it was some great people in SCS earlier today and now I get to visit with you and tomorrow I get to visit with the library community and so it's always a treat for me. I wanted to talk about the big picture today and I wanted to talk about building an ecosystem for a sustainable ecosystem for data because data pretty much drives the world that we're in. I'm going to try to stay on this side because I'm shorter than the average bear. This is a slide that you often see I think for a lot of people's talks. Data is just driving everything. If you're thinking about public health, who's most at risk to contract asthma, that's a data driven question. If you're thinking about agricultural productivity, these days people are thinking about it in terms of how do I couple air quality, data and germplasm data and terrestrial data and site data to try to answer that question. Energy, when the Higgs boson was discovered at the Large Hadron Collider, that was a big deal. Data is everywhere. Data is driving innovation discovery of the world around. Instead of going off and looking at agricultural productivity or energy and how data is doing it, let's just look at the data itself. I think what we all know is that available data is not good enough to drive this. Data is really not an asset if you don't know what it means. No one has curated it, you don't have appropriate metadata. Data is not an asset if you can't find it either. You don't have appropriate discovery. Where is the data you need? Data is not an asset if you can't analyze it. Is it in the right form for analysis? If I want to reproduce my results, somebody needs to have kept the data around. It turns out that in order for me to get to that really great place where I can use data for innovation, I need infrastructure. Now, where is the data? If you think about it, if for me to be able to use my data from my paper and from the past I need to have been thinking about stewardship and preservation. This is a picture of a movie which is data that has been decomposing. It turns out that about 50% of the movies from before 1950 have really decomposed. Today, many of our movies, if not most of our movies, are digital. I had the great good fortune this summer to be at the Association for Motion Pictures at the Oscar Place. They have, as you all will recognize, digital archives for those movies. Somebody's got to pay the bill for that, somebody's got to ready it, somebody's got to use it. Data matters. Now, it turns out that for data to be sustainable, the infrastructure on which we use data has to be sustainable. Without that sustainable infrastructure, data has no home, and the homeless data will cease to exist. It means that the public access that we're all now being required to have about various kinds of government-sponsored research data, the use and reuse of data, reproducibility results, data management plans, et cetera, we can't do any of that without data infrastructure. Let's talk a little bit about that. When we talk about a sustainable digital environment, what should we reference? What does that mean for us? One thing we can do is go to the real world, and there's been a lot of thought about sustainable environments, especially at the United Nations and other kinds of organizations. The United Nations-Buntland Commission defines sustainable development as development meets the needs of the present without compromising the ability of future generations to meet their own needs. That's exactly what we want in a sustainable digital environment. It turns out that the very interesting analysis that these folks did identified four kinds of factors that you need to have a sustainable environment. Ecological factors, you can think of that as infrastructure for us, economic factors, culture, and politics. What I want to do today is talk about key questions in each of these areas that we need to think about. The key questions that we need to think about in order to think about sustainable digital environments. Here we go. I want to think about four different challenges. For ecology, I want to think about how we can accelerate the development of that data infrastructure that we need. For economics, who is paying the data bill? For culture, I want to think about how we determine the stewardship and infrastructure and what's needed most, where is it needed most. Then I want to get a little bit out there at the very end of the talk and talk about the brave new world we're all going to, which is the internet of things. How important governance and politics and social structure is for that. Let's start with the first one. What kind of infrastructure do we need? I want to answer any important research question that I have. What will happen in an earthquake? Which energy sources are renewable? How do we increase agricultural productivity? Who's at risk for asthma, et cetera? It turns out that in order to answer those questions, I need lots of infrastructure building blocks. What do they look like? I might need to adopt data sharing policy within my organization. NIH has done this and said that if you're doing a certain kind of research, you've got to deposit the results of your research in the Protein Data Bank. Or the Alzheimer's data collection or the autism data collection. That means that that policy changes behavior helps create infrastructure. I might need data analytics algorithms. Lots of you and your colleagues may be working on algorithms that help me understand the meaning of the data that I'm talking about. Data citation standards. That helps me make sure that people know about the data. I've cited in various publications. All of these and more are building blocks that support the kind of research I need to do. Note that these building blocks are of two different types. They're of technical and social infrastructure. We often think about infrastructure as basically the hardware and the software. But we tend not to think about the policy and the practice and the community agreement and the economics as part of the infrastructure. But that's important too. So here's infrastructure from our real life. Turns out that policy makes a huge difference. Policy changes the way people behave. If I say no smoking, that changes our behavior. If NIH says deposit your protein structures in the Protein Data Bank, that makes a difference in terms of how communities behave. That infrastructure is important. Systems interoperability. So it turns out we have very few systems where we do the same exact things the world around. And a really good example is power and plugs. And what's the outcome? Do I really care about plugs? Actually I don't really care about plugs. But what I care about is I want to go to a hotel room in every city around the world and be able to read my email. So what do I do if I want to do that? I look up, the last place I was that was outside of the US was France. And I look up on the internet which plug adapter do I need. Next place I'll be will be Japan. That will be a different plug adapter. That doesn't mean that I need the community to get together and figure out the shape, size, color, same power the world around. And then wait for everybody to kind of adopt this sort of universal approach. It means I can read my email between now and then. And really all that matters is not the infrastructure. It's not the plumbing I care about. It's what I do with the plumbing that I care about. And so interoperability between all kinds of systems that we use is incredibly important. To get the people who sort of think about length and inches to agree with the other people, the other community that thinks about length and centimeters, you can do that exercise. But why not just have a piece of code that helps you translate from one thing to another? So interoperability important. Common standards and metadata. In the research community and the library community a lot of people sort of think that's really important. But it is critically important. When I go to Home Depot, I don't just get the kind of lumber that they feel like cutting on any given day. My community, if I'm a construction person, has agreed on shapes and sizes of bolts and lumber and all kinds of things. Steve's laughing at me because I know nothing about what I'm talking about because I don't actually build things. But the fact is that communities agree on standards. And what those standards help them do is take components, put them together and build bigger things. And that's how we use them as well. When we communities come together, when the astronomy community comes together and has very certain standards for how it's going to look at its sky surveys, or the physics community come together and has certain metadata that will be applied to various kinds of data collections, that enables them to really do much more with their data. Similarly, sustainable economics. If your infrastructure is not sustained in any kind of reasonable way, it may not be there. I don't want to be waiting for the data bus and have the data bus not come because we ran out of money. We're going to talk about that in a very important practice. Training education and workforce. It's great that we have computers and phones and all kinds of technology. If we don't know how to turn them on, we don't know how to use them or not useful to us. All kinds of technologies is going to be absolutely critical that we go ahead and we actually make sure that there's low barrier to access understanding of how to use it. Finally, I want to talk about community practice. My favorite example for this is driving because I'll sign up, take the test motor vehicle code. What happens is this does not give us an example of every single situation we're going to see. It gives us good guidelines to get to an intersection. You stop at a stop light, who goes first, stuff like that. A lot of the things we encounter, we have to combine policy and practice and make sense out of it. That happens in the data world too. All of these things are critically important for the data world. We just have our own data versions of them. Social infrastructure is at least as important as technical infrastructure. It turns out that it's not just us. It's not that the US is worried about these things and people don't worry about them. It turns out that people all over the globe are worried about these data infrastructure issues. You see a lot of really interesting variation. It's not just us. If you look around the world, Europe is particularly interesting. There are digital rights agendas and all kinds of things that they're doing. Australia has the Australian National Data Service. All around the world people are really coming up with different approaches and sometimes similar approaches to solving the same kinds of infrastructure problems. If you think about the original question, which is how can we accelerate the development of infrastructure, take all these things and put them together. You need some building blocks. Those building blocks help you with innovation. It's very important for those building blocks to be social and technical. By the way, we have colleagues all over the world looking at these same problems. All of those things brought us to thinking about the research data line. This is a new organization, a new being about three years old. It was launched in March 2013. It went from eight of us who were on the phone in August 2012 to now a community of over 3,200 people from over 100 different countries and a really broad spectrum of domain. The focus of the organization is to build very pragmatic pieces of infrastructure that people can utilize in their research environments to then solve the research problems that require data sharing in exchange. Very simple. It's not doing decadal surveys. It's not solving everybody's problems. It's taking this very, very specific approach. What is the approach? The idea is to really solve problems and make progress. The members of the RDA come together in two forms, working groups and interest groups. Interest groups are longer live discussion forms where people are talking about what kinds of infrastructure would be useful to solve these problems, agricultural productivity, harmonization of marine data, et cetera. The working groups are groups that spin off typically out of interest groups for about 12 to 18 months. Go ahead and build it. There needs to be adopters in the working groups. They build a piece of infrastructure. It solves the problem. It solves somebody's problem, not necessarily everybody's problem, and they move ahead. It's a very pragmatic culture and it's really kind of fun in that sense. It's not a build it and they will come culture. Everybody's got to have some adopters in the group. Otherwise, they don't get to be a working group, so that's part of the vetting process of the organization. We would love to have universal Esperanto infrastructures that solve problems like my plug problem, but that's not what the RDA does. It builds very specific things, solving very specific problems. If you're a physicist and you have a particular interoperability problem, that's great. We solve something. If the agricultural productivity people can use that infrastructure as well, perhaps customize it or adapt it, terrific. Otherwise, if they need to build something different, they do. We promote technology neutrality. This is a platform for everybody to say, my stuff is the best stuff. The idea is really not to do any kind of world domination, but to work with lots of different data organizations out there to accomplish their goals as well as the RDA's goals. I thought I would give you a little bit of just a spin on what the groups are doing. So you have 3200 people. It's community-driven. No one tells anybody what to do. You bring your problems to RDA. RDA is a useful vehicle for many problems, and then the idea is you do it. If you kind of think about a space going from data provider to data consumer, and that's the beneficiary of the infrastructure, and the solution being kind of somewhere between technical and social, you have a number of different things, and the stuff that's in italics are sort of key words that we tend to see in the groups that do it. So social and organizational, they're worried about education, training, practices, ethics, things like that. Technical solution aimed at the data provider, they're looking at sort of the guts of the engine of data infrastructure. So they want to do publication, analytics, infrastructure management. This kind of gives you a little bit of a sense. So I think I started at the upper right. So on social organization where you're kind of looking at practices and literacy and stuff like that, whoops, wrong thing, where you're looking at social and the data provider, this is kind of a sampling of some of the groups that the RDA community created to think about those kinds of issues. So for example, we have folks who are looking at data costs, folks that are looking at certification of repositories, folks that are looking at legal interoperability for data, et cetera. If you're looking at now more on the data consumer side, there's folks who are coming up with curriculum for data science summer schools in Africa, other people who are looking at library practices for research data, et cetera. So all of these groups were developed by the RDA community. If you're looking at the technical spaces that are focused on the consumer, we typically have a lot of domain groups that are now forming, looking at biomedical data or marine data or biodiversity or digital ethnography. Anybody who kind of needs research infrastructure to support the data activities of the community, and then under the hood we have a lot of stuff around data type registries and a fabric of data of different kinds of deliverables of RDA, et cetera. So you kind of see among the interest groups just a broad spectrum of different kinds of problems people want to solve, who have come to the RDA because they have a global environment in which to talk with a lot of other people. They have actually a multi-discipline environment, so at any RDA meeting you'll see groups with librarians, information scientists, domain scientists, policy makers, et cetera. You get a really interesting mix of people. And then it's intergenerational as well. You see people who are grad students and at their early career and people who are much more senior. Okay, I've just done an eye chart just to kind of give you, you know, what are the working groups are doing. That was the interest groups and the working groups are very first beginning of a pipeline deliverables. There were eight of them. And a lot of them were in the technical space for the data provider. So you'll notice that we had a metadata standards directory group and a data type registries group and a data description registry group. And then we got our first kind of domain focused one. So the wheat data interoperability group were really worried about this issue of I have germplasm data, I have terrestrial data, I have air quality data, I want to put that. I want those all to interoperate so I can look at questions about wheat. And by the way, if I can look at questions about wheat, then I can look at questions about maize or rice or other kinds of crops. So this is kind of infrastructure. You want to move that infrastructure to other communities as appropriate. So the other thing I wanted to tell you about RDA is one of the best things about RDA has turned out to be its ability to kind of help build the data community. So our pleneries are working meetings. You go to the RDA meetings and it's not a lot of talking heads. We have a number of people who might talk in say a keynote or something like that. But our members want to spend most of the time working together in various interest groups and working groups and then they want to have joint meetings. So the digital ethnography people last time had a number of different joint meetings because they wanted to borrow what the metadata people were doing and what the library people were doing and stuff like that for their work so they could solve their problems. The pleneries are always fun and exciting but they also draw a bunch of other groups who are very interested in the RDA community. So we've had biomedical people and storage people and EarthCube and Data Seal of Approval all kinds of people co-locate their meetings with RDA with a nice cross-fertilization. So the community aspect has been really good and that community is important for building infrastructure. So I just wanted to give you one more slide. I will spend the entire talk tomorrow on the RDA and if you're interested in joining it's free. It's fun. Sign up. Believe in world peace and you're in. And that will give you a lot more information about the community. Three things if you ask what are the goals for RDA, more infrastructure. So the idea is like let's have some infrastructure out there so you can solve your research problems. That's what we want to do. Effective community. Let's facilitate the connections you need to make with colleagues so you can solve your problems. And then let's keep coordinating and create more synergy among the infrastructure we develop. So those are the goals of RDA. So let's go to the next thing. So we talked about infrastructure. Infrastructure is really important. We need to be building infrastructure so we can solve the innovation problems that we have. Who's going to pay for that infrastructure? So it turns out that somebody needs to pay for that infrastructure. It's not free. So you might say, why do I need to pay for infrastructure? I do my research. I put my data on my hard drive. I give it to whoever wants it. Good to go. That's free for me. Well, it turns out that if you can do that you are right there in the center and it's pretty cheap for you. You have locally manageable data. You yourself can manage that data. But it turns out that if you're trying to do anything that requires more effort it gets more expensive for you. If you're a big data person and if you're an astrophysicist you're a big data person so you do these big supercomputer simulations. You get 200, 300 terabytes of data. This is not something that's fitting on your hard drive. So what you need to do is figure out somewhere to host your data and keep your data for as long as you need it. That's typically three to five years till you do the next big set of runs. If your data needs to be longer live you want to keep it for two years, five years, ten years, 20 years, 100 years. Some of our data is required to be kept such a long time. Data like the census data or the panel study of income dynamics which has been around for decades. That's data we want to keep for a long time. You have to migrate it forward into new media. You have to make sure that it's duplicated or replicated. You don't lose it. If your data needs access control either really, really broad access Protein Data Bank needs to be available to people all over the world 24-7 or really restricted access. You have HIPAA data and that means not everybody can see that whenever they want to. Those access control infrastructure systems are expensive. If you need more curation if you need coupled data services if you need more stewardship or management if you have big data, all of that costs more. It turns out that the more infrastructure that you have beyond your hard drive the more cost you have. You may say, what are we paying for? Storage is getting really cheap. What's the problem? It turns out that to do a really responsible job of it you have to pay for a lot of different things. The graph on the right is a graph and I was director of the San Diego Supercomputer Center and we put together a data archive for the community. We kept 100 collections for a few years. It was called Data Central. This is our storage growth. As we provided and in the end we had several petabytes worth of data which are about 100 data collections. Each of those data collections we made several copies of and we had to keep the red line is the storage we had to keep buying. We wanted to of course stay ahead of the curve. Our community is kind of that blue line fits to the purple line but the idea was that as the research collections increased we had to stay ahead of the game and we had to make sure that everything was okay. What systems were in place for that data? We paid for a lot of things. We paid for space. We paid for networking. We paid for backup systems. We paid for monitoring and auditing systems. We paid for maintenance and upkeep. All of these things compliance with regulation and policy and all of those things turn out to cost money. When you're paying for data infrastructure not only is it not free but you have to kind of get it right. If you think about it this is your data's home. You want all the systems to be working. Somebody better pay the water bill and the electricity bill and the taxes and all of those kinds of the insurance and all of those kinds of things and the same thing is important for your data. You think we know we need data infrastructure for innovation and we know that we need somebody to pay for the infrastructure. What's the problem? Why is it such a hard sell? It turns out that it is really hard to get people to sort of pay the level of appropriate attention to paying the data bill especially in our academic environment. It's interesting to kind of tease apart the question what's research and infrastructure? What's newsworthy in research land? It turns out that we've just heard recently about a bunch of different Nobel prizes. Really exciting, a lot of different attention. When the Higgs Boson was discovered it was, science said it was discovery of the year. It's sort of all the moonshot stuff. That's really, really exciting. This is good news. All we hear about infrastructure, all we hear is bad news. Someone lost your social security number. The electronic medical records are not working. You can't sign up for your healthcare. All what you hear about is really your target cards, all the information got stolen. What you hear about is the bad news. There's this kind of good news, bad news. The fact that the light stayed on in this room is not newsworthy to anyone, but it's infrastructure and without lights and power there is no talk. What's the value proposition? Research has a much better value proposition. Discovery, innovation means competitive advantage. It means leadership. We're ahead of the game. We're number one. Infrastructure are much weaker value propositions and enabler of innovation, but not so easy to sell. Of course, the funding model for research, none of us think it's enough, but it's manageable. I have fixed term funding. I have a project, I write a grant, I ask someone for money and they give it to me for the term of the project. Infrastructure needs continuous long term support. If you look at it, we understand why infrastructure is such a hard sell. We really need infrastructure, but this is like your water mains. No one's paying attention to your water mains until they freeze and break. Then you have to think about where do you get money for new water mains. In light of all this, or perhaps in spite of how hard it is, we're now getting an increased reason to have appropriate stewardship and preservation infrastructure. In February of 2013, the White House sent out a memo through OSTP saying not only do we want to make the publications available through federally sponsored research, but we want your data to be available too. Of course, as the New York Times pithly said, we paid for the research, so let's see it. The fact is that when that 2013 memo came out, it's best by the agencies to come up with plans, it gave them six months, et cetera, but says there's no new money for this. As all of us in the research community know, every dollar is really precious to us. We don't want to spend it on infrastructure. However, and we think that more of it should be available for research, but we need more infrastructure to get that data available. In response to that, Vinsurf and I wrote an op-ed for science, and what we wanted to do is get it out in time that agencies would start fact-daring in realistic plans for paying for appropriate amounts of stewardship and preservation infrastructure in their plans that were due in September. This is who will pay for public access. It came out in August of 2012. We got a bunch of really somewhat entertaining and not always, and sometimes colorful tweets because you had to actually pay for to read this article in science. If you want to read the free copy, it is on my website, just saying, and you're welcome to read it. But we did get some good colorful tweets, which I did not include in this talk. So what was our solution? Our solution was, look, no sector is responsible for paying for the data infrastructure all by themselves. That's completely unrealistic. But the fact is, it's a shared expense, and if we start pushing forward in every sector, we could actually start catching some of the data on which our future innovation depends. This is really important stuff. So in the academic sector, libraries are the place to be. It turns out that university libraries, institutional repositories, domain repositories have incredible expertise, and exactly the expertise you want. They're used to providing things as a public asset. They understand laws. They understand policy. They understand curation. They understand stewardship. University libraries and domain repositories are really important players in all of this. Now, most of them are not going to get up to speed about being data hosts for their university without some bump in funding. So, you know, how do you start up that kind of a thing? And then how do you develop a sustainable business model around the libraries who want to be doing that? And what I notice is that there's a number of really progressive libraries that are starting to try to develop sustainable business models. But our suggestion was, for a little bump in a program in National Science Foundation or other kinds of funding agencies, you could really help libraries and then develop sustainable business plans that help them provide that research data for the community. For the public sector, the government does host a lot of our data, but it doesn't host all of our data. And the question is, you know, what's the difference between data of mine that the government is willing to host and data of mine that the government is not willing to host? And the fact is there's not a lot of clarity there. You know that NIH has been very progressive about, you know, proteins and Alzheimer's and autism. NIST hosts a lot of data. NIST hosts a lot of data. NSF hosts some data. But you don't really know whether you're in the government-supported bucket or the not-government-supported bucket. So some clarity about what you would need to do to go from not-government-supported to government-supported and what that path looks like. Is it a number of citations? Is it a number of bytes? Is it a number of people in the community that need it, et cetera, would be very, very helpful. And so all four of these directions are directions that myself and many other people are working on. In the private sector, all of us go to, you know, symphonies and ballets and plays all the time. We look at the back sheet. This was, you know, sponsored by Google or General Electric or, you know, Sloan Foundation or, you know, someone. And it turns out that there is a lot of philanthropy around things that we think are public, of public value. Why not our data? Why can't we start looking at data philanthropy? Could it be the Fran Berman-a-doud data collection on, you know, pick your favorite topic? And this is something we can do when lots of time companies get tax incentives to do these kinds of things or incentives of various sorts to do it. A lot of our companies have extra capacity where they can host our data. And it's worthwhile to sort of figure out the contractual pieces. What's private? What's not? Can they stick their hand in your data pocket while you're trying to look at that and really get that right? Those are public-private partnerships that we can do more about. And then maybe the most controversial thing that we suggested was that individuals really could, in the research community, could pay for data. You know, copy of a Lady Gaga song, why aren't I willing to pay $1.99 to get a hit from the Protein Data Bank? And, you know, what's wrong with sort of doing things at a really low barrier to access that acknowledges and helps support infrastructure? So all of these things we think are a shared solution. And really important for paying the data bill. And I wanted to end this section of the talk with something that Cliff Lynch, who's an incredible leader and head of coalition for network infrastructure, had said. And it's a little bit long. But what Cliff is saying is, look, when you make the case, you can't just say data infrastructure is important to me so it should be important to you. Having been a former vice president for research, I'll tell you that will get you nowhere. You know, there's like 100 people every day at your office saying, you know, give me a bucket of money and mostly you have a thimble of money, but you're not telling, you know, that. But what you have to understand is why do we need infrastructure and it turns out that we need infrastructure for innovation. And why is innovation important? It's leadership. It's competitive advantage. It's our future in so many different ways. We can make that case. Infrastructure is a prerequisite for it. And so in some sense, you know, if we want that data to be available and to be used and reused, a precondition for that is infrastructure. And a precondition for infrastructure is somebody's got to pay the data bill. So we need to be smart about how we make the case for this. And that goes for all of us, so definitely including me. So, okay, let me go to the next thing and these will be a little shorter. They're not as fully baked as the first half, first part of the talk. But I thought that you might be interested in seeing some new ideas that are sort of work in progress. And one of the things that I've been worried about is, you know, okay, so say we can invest. Say we make that case and we can invest. How do we determine where stewardship and infrastructure is needed the most? And so one of the things you can do is you can think about what's the stewardship gap? And so imagine all of the data making publicly accessible and how much of it is pretty sustainable, you know, in ICPSR or, you know, in the protein data bank or in various other vehicles and how much of it is at risk. And it turns out that nobody really knows the answer in any kind of a broad base way. But there was recently a really great paper that Jerry Sheehan from the National Library of Medicine, a bunch of other people in my age did trying to estimate what the stewardship gap was for data that was used in publications for 2011. And so what they did is they looked at PubMed Central. They looked at, and the methodology is really interesting, so I recommend reading the paper. They look at the publications, those publications refer to data. For all of that data, they look at where is the data. And what they found is about 12% of those data sets were in something they recognize as, say, a sustainable repository. And 88%, almost 90% of that data was what they call invisible data. They don't know where it is. Could be in a good repository. They just don't know about it. Could not be. Could be on your hard drive, whatever. They estimated that that's probably 200,000 plus data sets. And then what they did is they looked at those invisible data sets which are arguably at risk. And you know how much of that reflect use and reuse. So it turns out that most of them were kind of new use but some of them were reuse of other data sets that people had. And more interestingly about half of those data sets were derived from human or animal subjects. So if you think about reproducibility, you would imagine that it's harder to reproduce things with specific human or animal populations. So it turns out that, you know, they did a really great job of trying to make some estimation. And that there's a stewardship gap there but if we look across the kind of academic activities that we do, we have no idea. So Myron Gutman and I are trying to like figure this out. And we developed something called the Stewardship Gap Project. It's in its pilot year. So we won't have answers on everything. And the goals are to develop a kind of methodology we need to really estimate what is at risk in the stewardship gap, what's there of value and what are the policy and financial implications. So that's the most important thing because what we want to do is give stakeholders, funding agencies, people who can pay the data bill, the ammunition they need to really go ahead and say, hey, this is a problem. Here are the pressure points. Let's invest here. So that's really important to us. And then we roped in a bunch of people whose work we really respect. So John Gantz has done this kind of methodology for those great digital universe reports that many of us have read. Andy Malts and Elizabeth Cohen have worked on the digital dilemma, which talks about digital stewardship problem for movies. We have Saeed from the Johns Hopkins Library and Guhaan Fint from Google who has their own spin on this. We're really trying to get people who have a huge depth of experience, a sort of strategic perspective on this. And the idea is then to sort of come up with a survey that we're going to be able to start answering some of these questions and start looking at, you know, what is the stewardship gap and what does that look like and then to work with people like Phil Bourn at NIH and other places to figure out what are the answer what questions do you have and what are the answers you need in order to make the case around data stewardship and preservation. So let me just tell you that a side light of what we're doing is we want to understand what's valuable. And it turns out this is like tremendously interesting and intriguing. So, you know, what kind of what's valuable? And you can think about valuable as thinking things you want to keep over the long term and sort of hear three ideas. You know, as society we've determined, you know, what is valuable are kind of official data historically valuable data. You could think about, you know, Obama's emails are getting hosted at the National Archives and there's the Shoah Foundation as Holocaust Survivor Testimony. This is stuff that's of societal value. If it goes away, there's no way to get it back census data, etc. In our research communities there's lots of data that's very valuable to us on which we depend. So, Sloan Digital Sky Survey their rabidopsis resource that had lots of funding problems for a while, a protein data bank, etc. These are data collections on which a big part of the community rely. And then me personally there's lots of data that I think is really valuable. My kids' graduation pictures those are actually my kids, not all four of them, two of them. My tax records and stuff like that. So, now it turns out that I'm taking responsibility for my data and the government's taking responsibility for the government data but we have a lot of issues about this data collections in the middle. Who's taking responsibility for things that are of community interest and what's valuable. So, just kind of some interesting work in progress is when we talk to our planning committee planning an advisory committee about what data is valuable value is in the eye of the beholder but you can start getting at something more concrete by really kind of looking at different kinds of value. So, data that's in demand by researchers is more likely to be valuable than data that's not. Data that's required that we keep is more likely to be considered valuable. Data that we have to preserve as part of good scholarly practice. The data that's associated with our publications that's valuable. Data for which value improves over time. Panel study of income dynamics around for a long time to longer it's around the more valuable it is for us. So, data that I can't replicate easily. That's likely to be more value. So, we're trying to kind of understand what is that and does more value lead to more or better stewardship. What is the connection between stewardship and value? So we hopefully we will know more about it so stay tuned. I want to end with actually three slides on this but this is something I've been thinking about a lot that I think is a real brave new world which is the Internet of Things. And for those of you that know the Internet of Things as you'll notice that it's like everywhere and for those of you that don't think a lot about the Internet of Things we are hooking up everything. We're hooking up the sensors in our toaster and the sensors in our car. We've got self-driving cars. We've got wireless cows. You know, we've got all kinds of crazy things these days. What happens when all of that becomes our environment? So, it turns out that we're now in what you can think of as the stone age of the Internet of Things. And, you know, we're starting to look at what does it mean to have self-driving cars. You know, does that freak you out or does that make you happy? Does the self-driving car drive better than your aging parents or worse? Could the self-driving car drive your to school if you're, you know, have a meeting? You know, wireless cows. People are starting to look at, you know, where are the cows and, you know, how can I do herd management and all kinds of, you know, precision agriculture and things like that? Of course, there's how 9,000. I couldn't help but put how 9,000 from space Odyssey on. But, you know, in some sense that was like an early precursor to the Internet of Things. And, you know, when we think about this, we have kind of two ways of thinking about it, you know, either it's like an incredibly wonderful enabling environment or we're going to have Lord of the Flies out there. So, you know, how should we manage or organize that? Who's going to develop the laws for the Internet of Things? Who's going to enforce the laws? Can you opt out of the Internet of Things? And it turns out that when everything has surveillance and all of your stuff is, you know, connected to each other, opting out may not even be an issue. So what happens when you can't? And you can kind of think about all these questions, you know, we have. Who's accountable when my self-driving car hits someone? You know, if I could have intervened, am I accountable? If I couldn't have intervened, is it the algorithm designer that I sent to jail? You know, what decisions can be made by technology? You know, I might like it if my refrigerator talks to my grocery store and says, Fran needs more milk. If my grocery store talks to my insurance agency and says, God, all Fran does is buy wine, you know, does that mean I'm going to have a larger insurance bill because they think that I'm more at risk for alcoholism? You know, so there's good and bad sides of this. Does your privacy matter more than the needs of others? You know, maybe I don't want you to know that I was just someplace that I could have caught Ebola. But TSA wants to know that kind of thing. So, you know, when does my privacy matter more than everybody's privacy? And, you know, does your computer know good from evil? Does my computer know the difference between, you know, an email from Steve and a denial of service attack? And, you know, what are the ethics of my computer? So, when I start thinking about Internet of Things, you know, I think it's appropriate for us to think what does governance mean? Now, if you think about it when the Internet was started so many years ago, if we had known that privacy and security and the things that are big issues now are going to be big issues, we might have framed it a little bit different. But we are at that place, at the stone age of the Internet of Things. So, what should we think about? Well, go back to the UN who spent a long time thinking about stuff like this, and there is something called the World Governance Index, which is based on the Millennium Declaration and their key governance themes. So, you know, when we think about governance, you know, we are putting mechanisms in place for peace and security and, you know, rule of law, human rights, sustainable development, human development, etc. Well, what will that mean in the Internet of Things? What, well, you know, in peace and security, what it means is you know, trust, safety, crime, that is going to be really important for us. You know, we are starting to see things like Stuxnet and the various kinds of attacks that can be made on physical and other kinds of systems that are tremendously important. So, we are going to need laws around that. Democracy and the rule of law, what is a legal framework for determining appropriate behavior on the Internet of Things? So, if my self-driving car does hit someone, am I responsible? Is the maker of the car responsible? Is the algorithm designer or developer responsible? Is the company responsible? So, you know, and those are questions we are going to have about a lot of things. If you think about rights and participation, are we going to have an IOT bill of rights? You know, rights you might imagine, maybe you have a right to privacy and a right to control your own information, right to opt out. But what will these actually mean and how will we enforce them? In particular, what does equality mean on the Internet of Things? Does that mean that my toaster and I are equal to one another? How do we penalize discrimination? So, those are issues. Sustainable development will need architectures and standards and policy and infrastructure to promote the kinds of growth we want. And then, you think about human development and, you know, we have artificial intelligence, you know, where we have artificial ethics to go with it, you know, and whose ethics should my machine have? And because there is a notion for how these things get used and what's permissible and what's not permissible. And it turns out that it's going to be really hard for us to figure it out on something that is so comprehensive and has so much potential. So, all of these things are things that for the most part no one has a clue as to how we're going to do it, but it's really important to start thinking about them now. So, what's future work? And if you think about all of the, and there's a lot of people who are starting to come together, there's been a lot of work in Europe around the Internet of Things, some really interesting reports, some reports from the US and other kinds of things, but you know, you kind of think about getting the governance right is sort of Internet of Things maybe 101. You know, what's advanced Internet of Things? Well, is it a society? You know, and if you think about it, what's the ethnography of the Internet of Things? What's that going to look like? You know, who are the citizens? You know, what's ethical code? When we think about you know, governance structures, we think about a common good. What is the common good on the Internet of Things? What are the ethics on it? And when we start sort of developing all this, you know, do we have voting procedures? Who gets to vote? You know, if I don't know, you know, Google has a vote and Fran has a vote and my toaster has a vote and my car has a vote, you know, what is that and CMU has a vote, you know, what does that mean? So, you know, I have no answers for these, but I've been thinking about this a lot and I encourage you to join me in thinking about this a lot because I think it's going to be tremendously important for us to kind of get this right. And I think it will make the difference between this being an incredibly great future and a really, really scary future. So let me just go back to the original question I asked. I've taken to sort of ending my talks with small concrete steps. Not the big vision because what I'd like you to do if you're willing is to leave this talk and like go do something that's good for the world and data. And so this should be something we can accomplish in a small way in our own lives on Monday morning. So here are my suggestions. Ecology. If you're a researcher or you like researchers or you have friends who are researchers you know, create and implement a data management and stewardship plan for your project, for your data what is happening to your data? How are you going to make sure that it's safe, it's out there, it's discoverable, etc. Put your data in a repository. Make it safe, you know, try to figure out what that is and what the data bill is going to be and in fact make that data bill a priority when you think about budgeting for your project. What are you going to do for your data now? What are you going to do for your data after the time that your data is not paid for by the grant? On culture, you know, all of us can do things that contribute to data sharing. You can cite and publish your data. My friends from RPI have this great idea that if I go to my regular old conference, that there's a data session. And I have a publication in that and I put my data in that and that's a way for me to get a publication for curating and taking care of my data. And then I think we all have to make the case to stakeholders. I think we want to adopt and support the right policy and practice, but I think we need to make the case beyond this is important to us. This is important to everyone. And I think if we do it, we really have a bright future ahead of us. So with that I want to thank you and really appreciate you coming to the talk. Time for a few questions. Steve has a question. My question is that you made a case convincingly and many others have made it over and over again that data and computation are enablers of advancement in almost every academic discipline and continues to spread into almost every discipline we can think of. So I think we need to make the case for every discipline we can think of except the highly theoretical ones. So I wonder if there's some sort of way that we could make a big flash of light across the university communities and congress and policy makers that data deserves the same amount of attention that was given almost instantaneously to supercomputing when they began to talk about scientific visualization. The scientific visualization drove the funding of the supercomputing community for the critical period in the early years when it could or could not have taken hold. So I'm wondering if the same thing might be possible to raise the level of visibility and get people to recognize the importance of data management and creation. That's a great question and I'm one of the people in the intersection of those two worlds so it's fun to try to answer that. You know when those at least in the US when the supercomputer you were there too when the supercomputer center program started there was the blue book and they listed a whole bunch of things you know turbulence and Yeah they were presented as pictures too but in some sense the argument to simplify was made that unless we have supercomputers we can't predict the weather well or we can't understand how airplanes operate or a bunch of other things and so you know if you get this tool then you'll be able to sort of address this question and I think we have a harder time in the data world because we don't have that same kind if you build this data infrastructure and you amass this data you know what then and we can identify sort of data driven problems but I think you know there is some data infrastructure out there things are sort of scattered I think we don't have the same cohesive set of arguments that we did in the supercomputer environment and I think so if you look at that contrast I think it's like the slide I gave on why is infrastructure such a hard sell I think it's just really difficult to sort of elevate it to the right level now that being said when we used to talk you and I and many other people in this room in the 90's and the zero's and you know and people would sort of yawn whatever you know tell us something interesting and you know now data has reached a national priority not just here in the US but all over the world and big data has helped us we are surfing that wave whether your data is big or not or big data but you know I think without that you know real driver and priority it's really hard to get the infrastructure you need and then of course just pointing to the lights you know knowing that the lights worked or that the water worked you know that isn't sufficient to get people excited often times and people have many many many pressing priorities so you sort of do the thing that'll get you that'll solve the biggest problem or get you the most you know creds first and often times data is not in that space do you think there's anything that can be done locally within given the university to demonstrate to support the notion that there is a place or a group of people to go to to bring data scientists into their work whatever it may be well I do think that the libraries are you know it's like they are really the hidden in a lot of universities advocates for really a modern world where research and education and all of the things that we do in the university stuff is really supported in a very important way you know if you think about it you know things have changed a lot and these days if you ask a student you know where they go for information you know a coffee shop and you know look up Wikipedia on their iPhone and so you know unlike many many years ago you don't need to be in a library you don't need to be in a lot of other places but the libraries have a really important role as we go into the future and so every time I see you know somebody like the late Ann Wolford or or Brian Shotlander who I worked with very closely at UC San Diego you know Jim Mullums and Michael Witt you know who are doing great things at Purdue or Said at Johns Hopkins and look at sort of how they are going forward and really trying to get data to be an integral part of what libraries do I am thrilled because I really do think that libraries are a big part of the answer and you know but making that case to your university administration having been one of those folks you got to make the case that a new data design and getting ahead of everybody else as well as something that is really going to help your your researchers, but I think it can be done I think a new data design covered that you are starting up and it really shows some nice things that you can do you can do both text like this Time for one last question Being interested in your comment a lot of what you are talking about o'r amser o'r ddweud o'r ddweud o'r ysgol, a'r ysgol yn ddweud a'r ddweud o'r ddweud. Is there anything to go grab the junior researcher, the people that are just starting to create data and educating them in ways that then 10 or 20 years from now, they'll just fit right into a system that's curating the data? Well I don't know about the fitting right in 10 or 20 years from now, I know our experience in the research data alliance has been really terrific. We've really focused on intergenerationality because if nothing else, the RDA is a great opportunity for people early on in their career to expand their horizons to meet colleagues from all over the world to see different ways of doing the kinds of things they want to do to get down and actually build some infrastructure. What we found is that, not only is that a benefit for them, but they are a benefit for people who are farther along in their careers because they are natives in this world. I think about like, my parents didn't have television, but I had television ever since I was born, and my kids, but I didn't have the internet. My kids had the internet, so they're natives. I have the internet, but I'm not a native of the internet age, and I think for the new folks, the early career folks that we have with us in the research data alliance, whether they're computer scientists or whether they're biophysicists, data is a really important part of their both professional life and real life. In real life they're using this stuff all the time. It's not quite as bad as, wait for the rest of us to die out, but that really is, I think they're doing new things. I look at the digital ethnography and the digital humanities groups, and all those new folks coming along are using data as a matter of course, and of course on the NSF side of things, I'm co-chair of the NSF's advisory committee, and one of the things we're looking at for NSF is, where is data science happening? Where is it going to happen in 15 years? Is it in statistics departments? Will it be a special measure? Is it going to be part of everything else like the computational x thing that happened? But as more and more people become data literate and understand more about data science, I think it's going to come up into all this stuff. I think that's going to be a great thing because we're living in a data driven world. Because I was at NSF for a while. I've been grabbed lots of data management plans. There's a lot of education. They are bad. One, is it NSF felt that they could require people to put it into, to maintain repositories because it goes through your grant and you can only require it to educate the data. But then also just seeing how many some people are data business. Because you see them all, not just the ones that get funded. No, I'm so with you. Because they're like bad. Because like as VPR, look at this. Like what do you mean? But on the other hand, when NSF did that, it started the conversation. It elevated it as a priority. Okay, now we got to get it right. And NSF has decided how much data it could actually store or watch the store or who's going to point to. All of those are really important, super important implementation details. But at least we're talking about it. And so I was, you know, it's my favorite unfunded mandate. So, you know, so I'm happy to see it. But now we got to figure out how it actually works. But you know, interesting to hear your perspective. So Fran, again, thank you. My opportunity to converse with you a lot more at the Black and Black Network Institute. You keep a wonderful segment of promotion for the School of Prana. Good event production authority flies. Thank you. Thank you guys.