 Welcome. My name is Serge Goldstein. I'm the Director of Academic Computing at Princeton University and I'm going to be talking to you today about storing research data forever, even though Cliff has just told us that nobody's worrying about storing research data forever. So I guess I'm one of the few people that is worrying about that. Actually what I'm going to try to do is convince you, I'm sorry, what do I need to do? Can you hear me now? Okay. I'm going to try to convince you that there's actually very little difference from a funding cost perspective between storing data for a long time and storing it forever. Okay, so why do we have to store or why do we want to store research data forever? Well, one of the reasons is because we have to. Cliff talked about the fact that a number of funding agencies are now requiring that grants include some kind of what is generically being called a data management plan. The NIH was one of the first agencies to do this. This is a pointer to their data sharing policy, which was published back in 2003. All investigator initiated applications with direct costs greater than $500,000 in any single year will be expected to address data sharing in their application. And let me make an important point here that data management plans are not just about archiving or storing data. They are about sharing data, about disseminating data. It isn't enough simply to keep the bits. You have to make the bits available to other folks. I'll skip to the one that has been in the press and that we're here to hear about and talk about the NSF data sharing policy, which dates at beginning January 18, 2011. Proposals submitted to NSF must include a supplementary document of no more than two pages labeled data management plan. And this supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. By the way, these slides and presentations I think are going to be available on the CNI site, so you don't have to copy furiously or if you're interested, I can email the presentation to you as well. A point about the data management plan. The data management plan does not say that you have to preserve and disseminate data. It says you have to tell the reviewers what you plan to do with your data. Now, what you could say is we plan to throw our data on the floor. At the end of our research, that is a perfectly valid data management plan. Whether the reviewers will fund such a research project is another question, but we do have to make that point. Data management does not say you have to be explicit about what's going to happen with the data that's generated as a result of your research. If you're interested in other agencies' policies, there's a really great page by Gary King. Here's the URL. And the National Academy of Sciences has published a thing called ensuring the integrity, accessibility, and stewardship of research data in the digital age. So all of these URLs will be available online. Okay, a couple of quick points before we get to the core of what I'm going to talk about, which is how do you fund this. I just want to note that when we say storing research data, we're not talking necessarily just about numbers. We're talking also potentially about images, video, audio. If you're an astronomer, what you may be preserving is a JPEG image of a star. We can be talking about computer code if your research involves simulation. And of course it can involve text in various format. Then now here comes the really interesting question. What do we mean by forever? Cliff was saying that the NSF is really expecting people to store data for a few years. My feeling is I don't know what a few years is. It could be three. It could be five. It could be seven. So I'm going to try to come up with a method that's going to store the data. What I would say is indefinitely, in the same way that libraries store information today. You can't go to a library and ask them, how long are you going to store books? They'll tell you as long as we can. And I think that's what we're talking about here. Indefinite long-term storage of research data. By the way, this, one of the things that this is longer than, in most cases, is the duration of a grant. You're going to have to keep your data and make it available after the grant is done. In particular, after the grant funding is done. And that's the hard part. What do you do? How do you pay for storage of research data once your grant has expired? Of course, there are other reasons why we want to store research data because we want to, because we need to, to encourage honesty and so on. But let's be frank here. We're storing research data because the granting agency is telling us we have to. But there are some other perfectly valid and reasonable reasons why we should be doing this, regardless of what the granting agencies say. Now, what are some of the current models out there for doing this? One of people's favorite models is to let someone else do it. And one of the someone else's out there that can be storing your data for you are government agencies or labs or bureaus. And all of you, I'm sure, are familiar. There are a number of large-scale government-backed efforts to create databases of research data. I've listed a few here. This is probably out of date by now. There are probably many others that I haven't mentioned. But that's, that's one of the places where you can store research data. And if your data fits into the repositories model, this is a great place to put it. Largely because you don't typically have to worry about funding. You don't have typically have to worry about dissemination. All of that is taken care of for you. So if your researchers are producing data that fits into one of these national data banks, that's where you want it to go. Unfortunately that these data banks don't exist for all fields. They exist for very few. Some other possible places where you may want to store your data. There are a number of professional societies and journals which run data banks that will store research data. I've mentioned a few. Dryad is a particularly interesting one that stores ecology data. Another, of course, place to put it is at another university. There's sometimes some nice people at another university who will happily accept your research data and store it for you. Here are some examples. Some of the nice folks at other universities partner with the government. So a number of the government run things are actually partnerships between university and government agencies. And finally, there's the cloud which is emerging as another potential place where you might want to store research data. What I'm going to be talking about here is storing the research data yourself at your institution or in a consortium of universities because it turns out that for many, many, many research fields, there simply aren't any nice people out there who will store your data nor are there any government agencies that will do it. The funding for all of these different models is to be frank, somewhat abhazard. I've had numerous conversations with people who are storing research data and asked, what's your funding model? What's your business model? And they say, well, the provost is paying or we got a grant or what we have some money and we think we can pay for it or whatever. In many, many cases, even with the government run things, that doesn't seem to be any really firmly established plan for long-term funding of these repositories. Furthermore, many of these repositories require some form of ongoing payment. You have to pay X dollars per month to have your research data stored. Now, the advantage to that kind of model is it's a very capitalist approach to storing research data. The repository will store it as long as you keep willing to pay for it and presumably people will pay for it if it's important data. So this is a nice model not only for storing research data but for throwing it away. It gets thrown away when no one's willing to pay for it anymore. I don't think that's a great model to be frank because there's a lot of data out there that may turn out to be very valuable. 10, 20, 30 years from now. We're discovering from efforts like JSTOR and so on that all those journals that got preserved and are now digitized and online have a wealth of really valuable information that's being mined today simply because it was preserved. So my question is the capitalist approach isn't the one that we want to uniformly adopt. How can we pay for this storage in a reasonable way? Not necessarily paying every month because paying something every month means that there has to be somebody willing to fund and continuing to pay for the research data. Our approach at Princeton has been to develop what we're calling the pay once store endlessly model. And the acronym is pose, posi, posi. It's not a great, I wish that this phrase would turn out to be simply pronounceable but if you can come up with something better please let me know after the talk because it's very important sometime to have a good acronym for things or people don't remember it and it doesn't go anywhere. Why should people pay once to have their data stored forever? Because grants expire often and quickly because grant funding is not indeterminate. So if I have X amount of dollars and I produce research data it'd be nice if I could put it to my grant it's going to cost X amount of money to archive and store my data but that X has to be a fixed sum. It can't be an ongoing monthly sum. Grants expire, researchers expire even if you get your researcher or department to agree to pay some kind of monthly fee. The reality is over time those researchers go away, the departmental administrators go away, the department chair goes away and eventually someone's going to come along who's going to say why are we paying a thousand dollars a month for this data set? How can we store data forever? Well one of the reasons we can store data forever at our institutions is because unlike researchers our administrators expire very slowly. Their average longevity is much greater than that of any researcher. They're going to stick around and the administration's going to stick around and our institutions don't expire all that often. Princeton's been around for 250, 60, 70 years, something like that and it's likely to be around for another 200. It's a great place to store data long term. So how do we do it? How can we store your data forever but charge you only once? The answer is math. The magic of math. Now I'm going to do some math now so you all have to sort of take a deep breath. It's not going to be very difficult math but you'll have to remember it's basically high school algebra. Our basic model is as follows. We say look how much is it going to store me, how much is it going to cost me to store your data forever? And initially we just look at the electronics and disk drives. How much is it going to cost to store your data? Well we have an initial cost and the initial cost is how much it's going to cost me today to buy the disk drives. Now one of the things about disk drives and storage in general is the cost of disk drives decreases very rapidly over time. In fact the cost of disk drives decrease at about a 20 percent per year rate. How many of you do you know how much it cost 20 years ago to buy a 10 megabyte drive? It was about a thousand dollars. Do you know what a terabyte drive cost today? About five hundred dollars. Do the math. It's a very very very radical steep. In fact 20 percent is conservative. Disk drive cost, storage cost decline dramatically over time. So if we took a look at our initial cost we factor in the decline. We ask how many years are we going to, how often in years are we going to replace the storage. So we're going to initially put it on a disk. Three years from now we're going to have to buy new disks. Three years after that we're going to have to buy new disks again. Our total cost then to store the data forever is given by that equation. Total cost is initial cost plus in say three years. What it'll cost to replace those disks plus in again six years. How many cost and so on and so forth. And if you add all that up, guess what? It converges. That infinite sum is finite. It does not go to infinity. And in fact the total cost, if you use a replace or cycle of four years and assume that the decline in this cost is about 20 percent per year, the total comes out to be, my drum roll is not working, I'm sorry, it turns out to be about twice the initial cost. So if a disk drive today, a terabyte disk drive costs $500 and you give me a thousand dollars, I'm going to be able to store that data on disk forever. Replacing those disk drives every three years with the next generation of disk drives. Now some of you may be thinking, this can't be true, but it actually is true. The math actually works. This is what it really costs. And the other interesting thing about this and the equation is not only does it only cost about twice the initial cost to replace the disk drives forever, but after about 10 to 15 years, the costs are essentially negligible. What I'm saying to you essentially is today some of you may worry if a faculty member comes in and asks for a terabyte of storage. That may sound like a lot. 15, 20 years from now, a terabyte of storage will be nothing. It will be next to nothing. It'll be a blip. You'll be storing hundreds, thousands, millions of petabytes of data at your research institutions. This is an example of what we're calling data space at Princeton. Data space is the system we have implemented to provide this long-term storage to our faculty. And this chart shows some real disk drive costs over X number of years at Princeton University and shows a decline in those costs and a computation of what it would, what we're telling our faculty now that we need to charge them to store their data forever. And as you can see, we are charging, we're offering to store the data right now forever at about three dollars per gigabyte or if we add in tape backup about five dollars per gigabyte. Forever, not for a year, forever. Or in other words, about five thousand dollars to store a terabyte of their data forever. Now, if you think about it, this makes a certain amount of sense as a funding and charging model because the things like the NSF data management and data sharing policy specifically state that you can put the storage costs into your grant as direct costs. So this is not an unfunded mandate. NSF is saying, give us a data management plan and by the way, put it in your budget. So the faculty member is going to come to you and say, how much should I put in my budget? And we can turn around and say, how much data, how much storage are you going to need? They tell us the number, they estimate the number of gigabytes or terabytes or whatever. We do the multiplication and we tell them that's the number you need to put into your grant. Now, sometimes people will hear about this and will tell me that's a nice model, but disk drives only account for a small part of the cost of storing data. What about people? What about the staff and so on? And typically there are a whole bunch of studies that have been published that show that people account for only 5% of the costs of storage. The problem with those studies is what they do is they take each data set or each chunk of storage and assign to that the entire cost of the staff. But in fact, if you look at the marginal costs, if someone comes to you right now today and says, can you store a gigabyte of my data, you are not going to have to hire a new storage administrator. I mean, the people costs also scale as disk drives do. The people, the two members of your staff who 20 years ago were managing a gigabyte of data are today managing hundreds of terabytes of data. So the per gigabyte people costs also decline very, very rapidly and the model can account for them just as well as it can account for disk drives. I'm going to end now and just very quickly say that this model, pay once, store forever at Princeton has been tied to an operational organizational model called write once, read forever. What we want to do with this model is keep the staffing costs to a minimum, keep the overhead costs to a minimum. So data that goes into this repository, which by the way is using the d-space software from MIT, we have a very, very minimalist approach to data curation and management. We store the bits, we store them forever and we make them available over the web, but that's all we do. If you want us to translate the bits into some other format, if you want us to deliver them on some special kind of device, you have to pay for that separately. This is a very minimalist kind of keep the data, make it available, we're done. And if you'd like to know more, there's the URL. It's in our d-space repository. This is a paper that explains in greater detail the model and how it works. And thanks very much. Okay, so Serge gave you a, as he put it, a presentation that has a lot of math in it. I hope mine's less anxiety-ridden. It has no math, but it has socio- and political sort of emphasis to it, and that can give some people just as much anxiety, I guess. One of the things that I want to talk about is that produce approach to this has been sort of long coming, that we look at this as one opportunity within a long series of opportunities to work with people on campus. And so the, is it the data management plan in and of itself, or is this the idea that you're planning a service for a much bigger and longer thing? So I'm just going to give a couple of slides here to give a little bit of this context and background. So I'll show you where we're coming from. The, basically this is a sort of a collaboration on campus between the libraries, the vice president of a research office, our IT at Purdue group, and the faculty. And it sort of is based on a couple of products, one of them projects, one of them is the project out of the distributed data curation center, and another is out of the NCN's nano hub, the hub zero software, hub zero platform. So just as a little bit of background, the, the libraries has been working with faculty in what we call an interdisciplinary research initiative since 2004 to identify what kind of needs they had in the area of data. That's not that long ago, in internet years, that's, you know, 28 years ago, but it was the same time that the DCC in England was founded. So it goes back quite a ways in terms of the relationships that we were building with faculty to understand their problems with data. And so in 2006, we founded a center that was really meant to, to leverage research that we're doing with other faculty in their projects. So the, the importance of being able to establish these relationships help us move into this new area that we're working in. And this is a quick example of some of the projects that have either been NSF, USDA, seed grants, et cetera, that we have been partners with faculty at Purdue. And this is what I like to call the money graph that over the past 25, the past five years that we have been co-PIs and PIs with faculty on proposals. And this helps us sort of see what kind of issues they have, what kind of problems they have, what kind of worries they have in that sort of sociopolitical aspect of putting together a plan to start sharing your data when people aren't used to doing that. And one last piece of this is that I got to work behind the scenes in the office of the Vice President for research as a fellow and I did a project which was to analyze some compliance issues and look at whether Purdue would go with a policy, practice, guidelines, or how they would start to address some aspects of compliance. And I wanted to point out, Serge kind of got at this too, is that, you know, faculty have had a lot of pressure on them over the years. Well, it's not been pressure, but they've had compliance that they need to be able to account to. And the one that I was driven into me is circular, OMB circular A 110 section 52 and 53, which says that all research records will be available after the last financial transaction of a grant for three years so that anybody who's ever had a grant has always had to be able to keep hold of their research records for at least three years and research records has been clarified to mean research data as well. So with that kind of perspective, we were able to come into this and form what we called a data management working group. So the sort of social political way of looking at this was to say if if there is no one group on campus that is going to step up and say, here's maybe what we do. Are we just going to let faculty put something together or was it in the best interest of the university to try and bring some people together, talk about this and come up with some ideas. So this was the first of June co-led by the dean libraries and our CIO, Jerry McCartney, and then also some faculty. And we had a lot of discussions and basically the outcomes that we came up with were could we use the data curation profile to identify some areas that needed attention in when somebody was going to put together a data management plan. The VPR would identify some proposals going forward for the January deadline and that we would have librarians involved in the process. Basically what we had were a using the data curation profile and I have to plug this. It's in my performance review. The data curation profile is a tool that we developed as part of a research grant from the IMLS, the Institute of Museum and Library Services, funded a research project in 2007 and from that project we devised this data curation profile which is meant to be a specific, an interview that gets at the specific aspects of one's data set, a specific data collection. You can have inventories that look at campus wide data. You can have discipline oriented to try and find out what does a specific discipline do. This was meant to be a concise structured but flexible document for specific data sets of faculty. So again, having done this we were looking at what are those some of those issues that people had been bringing up. And there's also an interesting kind of awareness tool because when you ask somebody, well, you know what kind of metadata standards are used and they go, hmm, should I be using metadata standards? So the question then was can this profile address the mandate? And this is the line that we are saying is that it's not a direct solution to the data management plan and it's not a guide to directly curating data, but it is a tool which we think may help facilitate these activities. And again, we're sort of taking this as from a research project approach. So the data curation profile was designed to talk about the data life cycle. What are the local management and storage practices, the disposition, the dissemination sharing of the data and what preservation or repositories data was submitted to. And we have done less than two dozen of these so far. We are currently under another IMLS funded project to train people to use the profiles across the country so that more of these can be built up. We can understand more about faculty data through them. So if identified, Peter showed the, sorry, Sergei showed the URL for the NSF requirement. It says in the guidelines for submission to address these five points. And so I just want to go through quickly to talk about some of how these points are not addressed by the profile, but that the profile gets at the questions to ask people about their data in these regards. So for instance, when asked to, how would you describe your samples to be produced? These are the kinds of responses in asking people to describe their data that they have been able to produce. So what we've done is sort of laid the groundwork. Most people obviously can describe their data. But when it gets to looking at standards for data and metadata, a lot of people are kind of stuck. As Serge said, you know, does that matter? If a plan is a plan and somebody says, well, this is what I'm going to do with my data, then that's great. But is there something more that could be done? And that's one of the things that the libraries wanted to see if they could bring to this collaboration. The profile also asks, sorry, the NSF mandate also asks about access and sharing. And sharing can be call me on the phone, and I'll talk to you a little bit. And if I like what you're asking for, I'll send you a hard drive or something like that. But in the long run is what the NSF looking for something where there is data available to wide audiences, wide groups of people, different levels of people, so that a lot of people can get access to the data. Provision for reuse and redistribution, this is something that faculty have thought about, but maybe don't know how to go about doing this. So we had one person that we talked to in the pharma medical area who said she knows of some of their current standards, but she wants to be able to do something even better to make her data more easily shareable. So I think raising the awareness of all of these things and then plans to archive data and samples. So what we did at Purdue was to take this profile and create a new instrument, which we don't have a name for it yet, but we are testing the instrument on a number of faculty. We have four of them who are submitting proposals for January. We've got a long ways to go, right? January 18th. That's a long ways away in terms of submitting a proposal. So we have done a first set of interviews. And in this interview, we also had in addition to Jake Carlson, who is our data research scientist, leading the interview instrument, we had a subject librarian and grant coordinator. And then we're doing an analysis of draft of the plan. And you may ask, isn't that a heck of a lot of overhead to put into this? So first of all, we're approaching it from a research perspective. This is a test to see if this is a process that can work. Can we create an instrument that we could push out to the faculty so that they could do this and then be able to do a first sort of draft of looking at those issues that are identified in the NSF mandate. So we want to be able to push this out. What we're doing right now is identifying people who haven't, who have lately stepped up and said, oh, yeah, yeah, is Purdue doing something about this data management plan? And so they're asking our pre awards people are sponsored programs people. And what we're doing is providing consulting workshops, which is a combination of the libraries with the office of the vice president for research. And the requirement for attending the consult consultation workshop is that you have to have had a technical section of your proposal done. So you know basically what's going on and then work through this instrument to be able to come to the consulting session with questions. And this is to sort of scale it rather than do a one on one on one over and over again as they have a broad workshop and address lots of questions if possible. So then the other part of this is that the hub zero platform at Purdue University is seen as sort of a platform for providing the various services that will go along with a data management plan and then the underlying storage or repository that goes with that. The libraries are helping design what's called a curation core. We've worked with the nanocomputing network on a couple of other sort of standards related thing implementing OAPMH for instance to expose objects on the hubs and a linked data project to link data between hubs. And so we're working with them to identify some of these curation core develop a timeline that we can implement some of these services. And then the libraries are also developing related source resources for being able to point people into directions for questions that they have. And we see this as one of the library's liaison roles to be able to provide reference to people. So it's not that somebody comes and says, you know, I think there's a new version of XML. Can you help me implement it? It would be, let's see if we can find somebody who can locate that and find a user group or something that could help with that. We do have a subject specialist within the library. So a chemistry librarian who was up on that would be able to help a chemistry faculty member for instance. So I just wanted to leave with this comic strip. This was sent to me the day of the NSF, the announcement first came out. And I thought it was kind of poignant in that people are like, oh, now we got to document this thing. So, okay. And then we'll have time for questions. Thank you. Any questions? Yes. So it's primarily for trying to implement some standards. So for instance, in order to have a preservation service, we want to be able to implement something like premise or we're still negotiating what would be in there. But some persistence. So we are using the data site in order to provide persistence for digital objects, and then sort of citability services that go with that. I don't have the full list in front of me, but we have to moderate this ourselves. Hi, it's David Rosenthal from Stanford. I first wrote about the issues around endowing data collections about three and a half years ago. I want to take issue with a number of points. Firstly, the costing that you're doing is about 20 times the raw disk cost at the moment. Two terabyte rise about 150 bucks. So there's plenty of slack in what you're charging to account for costs other than the storage cost. But there are a lot of costs that you're not taking into account here. Firstly, the work at the San Diego Supercomputer Center and elsewhere seems to indicate that roughly a third of the cost is media, roughly a third of the cost is staff, and roughly a third of the cost is power, space on the data floor, support, things like that. And what this means is that since the those costs actually go up over time, the staff costs go down slowly and historically the disk costs have gone down very rapidly, that the overall cost is very sensitive to the continuation of the exponential drop in dollars per byte for disk storage. And what we're hearing from the disk storage industry is that this decreases likely to stop for the next few years. There are serious technological issues about transitioning to the next generation of this technology, which is required for four to eight terabyte drives. First, so that you'll notice that we should have had four terabyte drives already, but we don't. We only have three terabyte drives. And also, there are very serious business issues around even the idea of building an eight terabyte three and a half inch drive. And so, particularly the sensitivity of the costing model that you have to the exponential drop continuing in the short term is very strong. And you need to build a model that's robust to the disk drive costs not going down in the early stages of your storage, because at the moment that's what we're hearing from experts like Dave Anderson at Seagate. And also, the other problem is that your model's based around having one copy on disk and one copy on tape. That's what you're charging for at the moment, as I understand it. In the long term, that's not going to deliver reliability. One copy on disk is not going to deliver reliability even in the short term. So yes, I agree that it's very important to get to a model in which it's possible to endow data storage for the long term. But I don't think you've got that. Thank you. Okay, comment. Do you want to respond? Should I respond? Okay. That's a lot. First of all, with regard to the cost of disk drives, I wouldn't want to buy a $150 to terabyte drive. There are $150 to terabyte drives, but they're not going to be very reliable. So our costs are based on drives that we think are. But let me just finish before you get back and respond again. So they're based on what we regard as minimally reliable. But you're right, our costs are somewhat high because they have to build in reasonable disk drives, reasonable electronics. With regard to how rapidly disk drive costs are actually decreasing, one of the nice things about this model is that as long as they decrease at any rate at all, the model converges. And what's interesting is that over time you can adjust the costing parameters. In other words, every year you can look at what's happening with disk drive costs and adjust the parameters appropriately so that some years you may do better, some years you may do worse. As regards to how rapidly storage costs are actually going to decrease, I think if you limit it to disk drives, there may be some proximate problems. So I'm very optimistic about this and I don't believe prognostications that say disk drive costs won't decrease rapidly, there's every indication that they will, but already we're seeing the replacement of disk drives with various kinds of RAMs and other kinds of storage and I anticipate that in this technology we're going to see similar kinds of developments. The fundamental point is the cost of the electronics, the disks and everything associated with storage, including people, if you base it on a per gigabyte model is decreasing over time, we can argue about the rate, but it is decreasing and as long as it is decreasing then that equation is going to converge and you can offer people a model where they pay you once and you essentially store their data forever, but I don't want to, maybe we could talk privately because I don't want to dominate the whole conversation with just our disagreements about what you should pay for disk drives. And I think that was, that's most of the points you covered, yes. Anybody else? Yeah, the don't get too optimistic about solid state storage. You can do the math based on the fab capacity that exists in the world and show that there are not going to be enough wafer starts to replace this magnetic storage with flash in the next five years or so because it takes that long to build the fabs and the fab capacity simply isn't there. And the researcher at Google and NetApp tends to show that the reliability of drives is not strongly correlated with the cost of the drives. But the point I'm making is that the economic, what you're charging this year has to pay for that storage forever. And what you need to charge this year is very strongly correlated with the rate at which you believe your storage media costs are going to decrease in the first few years of the forever, right? So that, yes, you can adjust it. But the problem is in adjusting it, you're now going to have, when it's now three years and you're going to adjust it because things haven't gone as you expected. Now you have to retrospectively charge the new people extra to pay for your mis- protection of the storage costs in the short term. So you get into a feedback cycle which cancels out the advantage of the cost decrease. Well, let me just comment. The core point of the model isn't specifically what cost X is this year or next year. The core idea is pay once. And I think that we can argue about what that should be. And in fact, we need to argue about it because many of my faculty are saying that if it's anything more than zero, it's too much. And that's the real problem is the notion that many faculty have that they shouldn't pay for this at all, specifically since these are data that they're making available to the world. But I'm more optimistic than you are and I think that we can have reasonable models that incorporate some amount of flexibility in the cost of the model. And I think that we need to have the cost decrease in disk drive and other costs. I think the fundamental notion that you need to not charge people on a monthly basis and grants for disk costs or for storage costs when you have no guarantee whatsoever that that funding will continue indefinitely is a good one. And Purdue, I'll just say that the cost model is, we have the vice president of the CIO wants to charge a somewhat similar rate. It's given that because a project may last two years or three years or five years, you only have a set amount that you can spend during that time and that for, we haven't quite decided this yet, but for certain projects under a certain number that can build in a lot of funding for storage that the university might have to subsidize that. And then as the, for a long-term solution that the libraries would figure out the overhead for maintaining the system for a longer term. Yeah, I'm agreeing that this is an important problem. The reason I'm pushing on it is because you're already setting a price. It's the best we can do. Right, I know. We have to set some price right now. Right, I'm agreeing. And so it's an urgent problem to get a better model for how to set that price. And I think we have, we have a lot more information than is, than appears to be being fed into your model, although what I'm getting from your model is you've already built in a substantial buffer. Yeah, there is a, there is some amount of buffering that's built into it. One of the things we were surprised at is how much the tape, just simply adding tape adds to the cost. And one of the questions is, you know, if you are storing data forever and you were saying that just disk drive and tape may not be enough and that is a real concern. The problem is we can't, we can't do this indefinitely. If we start doing two disk drives, two tapes, et cetera, et cetera, we're talking about costs that really will be totally staggering. Yes. If I could change the subject completely. I feel that you're justcribing this whole area from the central support library computing viewpoint. And I've been looking at it from the totally opposite point of view. I've just been asked to be on a faculty advisory group on this topic at Cornell. And there seems to be very little parallel between what I think the faculty might want and what you're offering. To start with, I have no knowledge of other disciplines, but in the disciplines I know the amount of projects that produce materials that are worth keeping is very few. And those that are worth keeping, it's usually short term. And the traditional way has been to take whatever you happen to have and burn it on a CD or some larger device which you put in a drawer and somebody discovers years later. There are a few very big projects I was doing calculation on my own current project and your $5,000 per gigabyte is a million dollars. Per terabyte. Per terabyte. Not per gigabyte. I've only got 200 terabytes you can do the arithmetic. It's still a lot of money, okay? It's a lot of money. Unfunded because you're just saying take it off your grant and the NSF just takes it off your grant. The other thing is the process by which grant proposals are written and I belong to a couple of departments but I have a spectacularly high funding ratio from the NSF, embarrassingly high. The way proposals are put together, all the effort is put into the scientific program and everything else of which this included is everything else is delegated to support staff. I mean I literally will not spend more than half an hour on the budget for a million dollar grant proposal. I think the universities have come up with ways in which some mid-level person does this sort of work for them and you're not going to expect people to put any effort at the time of writing proposals into anything at all technical about data. Finally of course there is the fact of the projects that I've been on where I expected to generate worthwhile data sets didn't and the ones where I didn't expect to generate useful resources did and fortunately the NSF is very interesting that if your research changes during the program the program officers will normally allow you to change your research but I think that that's the norm rather than the exception. So I think my basic point is the sociology of research which is different in every discipline has got to drive these things rather than the view as seen from libraries or computer centers. And I think one of the sort of curation services that we want to provide is what's referred to as deselection. How long if something is put up there how long should it stay up there. But you're absolutely right that there are some faculty who don't have data in the sense that they may have a simulation or a model or something and it doesn't have data per se other than the program itself may be the data. So I think that's one of the things that we want to do is we want to make sure that somebody look at their project. The ones in particular Purdue are the ones that are multi-million dollar grants that they want to make sure come in that they have the best proposal coordinators assigned to and what we hope to do is like I said this is a research experiment for us but what we hope to do is be able to get to it as well. I think it's a goal. Just very quickly, I understand that faculty think that writing their data to a CD or DVD is a good data preservation plan but as an IT person I know that it's a terrible data preservation plan and part of our job is to make sure that we give them an alternative that is better than that. D-space and data space at Princeton absolutely was designed to be handled by staff and mid-level people, not at all by the faculty. We have a lot of data preservation plans that are required before they need to submit the grant and part of what both of us are talking about is having online systems that make it very easy to generate these data management plans and put in a number. So yes, absolutely, we think that this has to be very easy to do, very quick. It is not an unfunded mandate. This is not money that is being taken away from the research. This is going to drop that $5,000 and we will be dropping it every year. Hopefully we can get it down to a thousand or less. The number right now is high. We know that but we are not asking the faculty to pay for that out of their research or out of their pockets. Yeah. Hi there, Kevin Kidd, Boston College. I will take for granted that we want to preserve data whatever the particular methodology of any given strategy. I don't think it is obvious that that is going to happen. I am not clear whether what you are proposing is a preservation strategy or a backup strategy. That seems to be one of the biggest things coming out of this. I am questioning about are you talking about preserving metadata? What about updates to the data that you are preserving? What about format issues? What about what we are going to get 50 years from now? I guess I am wondering what you thought about. I will tell you what our current model is. Our current model is a bit preservation and dissemination model. We store the bits and we disseminate them. You can access them through D-Space. There is metadata so you can do searches and we are looking at what that metadata should be but that is going to depend a bit on the specific model. That is going to be done by what all the metadata should be for all the various things. As regards preserving formats and so on, that is really, really interesting. I mentioned the fact that the funding model here preserves the original bits. It does not include funding for converting those bits. My personal feeling is that most of those conversions can be done in a different way. You can call us up today and say you have X data, it is interview data, it is stored in English, I need it in French. No, our model is not going to fund that. If somebody is going to have to pay for that. A lot of the ancillary cost associated with this storage would have to be paid by the person who is requesting the data if it requires some special handling. And issues of bit decay and stuff like that. We do preserve and we remember we are copying it every 3 years or every X years out to new drives. So the bits are getting refreshed. I think it was Cliff who said earlier that we are talking about a 10 to 20 year problem where usually things will get forward versioned or the software is backward version compatible. One of the things that we do with the profile is that we have to be preserved or I-carved. And some of them say forever. But some of them say after 10 years I doubt that this and it may be discipline different. We haven't seen enough to be able to say that. But I think some people don't expect their data to be there forever. I agree in premise with everything you guys are saying. I think there are concerns about what about linking data to something that's involved here. So I guess- Yeah, absolutely. These models only account for part to- I didn't talk about all the governance issues associated. For example intellectual property. What happens when a faculty member dies? To whom does the data belong? Who manages? Who can assign rights to it? All of those things need to be spelled out. Thank you. Will they be able to hear you? Like- Please use the microphone. What I particularly liked about the Princeton model and it was somewhat very much like the FLCA model where it was a dark override and they were just storing bits. But it's essential you've optimized your cost because you push curation back on the disciplines and in fact you know people in the world who know how to curate their data the best. And they could do it with libraries and conjunctions but the essential cost shouldn't necessarily be born by an IT center or by a library for that matter. So the notion of a d-space repository that I happen to like and how you're implementing it I think is a nice model. The one thing that I did want to ask you and I'm asking- What are you buying right there? I'm asking this what a bit out of ignorance tongue and cheek. I've managed storage for maybe longer than I want and I actually can remember buying one gigabyte of storage from IBM for $100,000 started to date myself. But there was always this issues of storage management and I was wondering how that figures into your equation and the software that goes along with it. Yeah, so our model is extremely minimalist on these issues. In other words, as you said we store the bits and the storage management hopefully is all those kind of issues are kept. We try to keep staff costs very, very low curation costs very low. We just got disk drives. We've got the data there. You can access it. For example, one of the things we're insisting on is that you make the data public. That's part of our model is that we have access to access rights. Access rights involve IDs and authentication. How am I going to authenticate someone 30 years from now? That's just not going to work for this kind of model. You really have to be able to say, no, the data are there. They're publicly accessible. Someone asked about replacing data. We don't allow that. You've paid for that data set to be stored forever. If you have a data set to be stored forever, we'll be happy to put a note in the metadata saying these data have been made obsolete or whatever if you'd like to do that. If your data are violating copyright or violating the law, then we'll definitely make it unavailable in the sense that you can't get to the data specifically through the repository so you may be able to get to the metadata, but we won't give you your money back. We won't give you the storage of that data set, and that's what we're delivering. It's toward the end of session. Any more questions? One more? No, you've already had enough. The question of the rights in data is very complicated. Do you have a form to dedicate this stuff to the public domain? There's a license, essentially, that you agree to at the time and it's fundamentally this notion that it's going to be publicly accessible. We do also ask the departments to designate a staff member who's going to curate the collection in the sense of simply letting us know whose data can be stored there and hopefully that responsibility can move as the department progresses and changes over time. It's very, very important that this be publicly accessible or else you get into just intractable issues. What happens when the researcher leaves the university? In our case, nothing happens. The data are publicly available. They've been paid for. That's the other beautiful thing. You don't have to worry about the person leaving. They paid for the data. It's there. It'll always be there. I think we should wrap up. Thank you very, very much. We'll be here if you have other questions.