 The Digital Preservation Network, remarks by James Hilton and Brenda Johnson at the 160th ARL membership meeting, convened by Pat Steele. Let me welcome you this morning. I'm Pat Steele, Dean of Libraries at the University of Maryland, and I have the pleasure of recognizing your good judgment in coming to this conference and to this particular meeting, because I think you're going to get some very good information about the background, the goals, and conversation that's been going on for a number of months about the creation of the Digital Preservation Trust Network. I'm getting my two. See, we're trying to make differentiations among all of these things that are going on right now, and the Digital Preservation Network, Deepin, is coming at us fast and furiously, and today the speakers that we have are James Hilton, who is the Vice President and Chief Information Officer at the University of Virginia, and Brenda Johnson, who is the Ruth Lilly Dean of Libraries at Indiana University. Both of them have been involved in pushing this and shaping this initiative from the very beginning, and there's a lot to share. I want to start by framing a little bit. The part of the talk that I'm going to do is essentially the presentation that was done for the AAU presidents two weeks ago. I also, the other caveat is, I literally got back from France the day, two weeks in France the day before yesterday. I'm not really sure what time zone I'm in, so I'm going to try not to be incoherent, but I can't promise it. So this is a presentation that was pitched for the AAU presidents, and as such, there are a couple of caveats that I need to make. One is that it is a pitch that is aimed at enrollment and a call to action, which means, among other things, that it's not high on subtlety, right? So I know that all of these issues are incredibly complicated. I know that, particularly in the library community, lots of work has been done on it, and the fact that all that work doesn't show up in the pitch is because it's an enrollment and an action talk. I used to teach intro psych, and one of the things about teaching intro psych is, you cover the, you dabble on each of the different areas, and I would tell the graduate students from the different disciplines who were my TAs, because I had 1,200 students, so I had 30 TAs, and I would say, when I get to your part of psychology, you're going to wretch because it's oversimplified, right? For what it's worth, when I get to my area of psychology, I want to wretch because it's oversimplified. So I ask your indulgence. The other thing that I want to just talk about a little bit is, by way of background, is a little bit of the history of Deepin. So I would characterize my role in Deepin these days as the chief evangelist, but Deepin is in fact a growing collaboration and Deepin has many architects. Deepin started really as a conversation between Karen Wittenborg and me when I, after I'd been at Virginia for about a year, our president had a, was leading a group that, they asked, well, what about digital preservation? And we started with a small group of folks. We had some conversations and that didn't lead to much of anything, but it planted the seeds that became Deepin. And basically, Cardin and I have been expanding the conversation since then. And so it is a highly collaborative work. It's a work where the collaboration continues to grow. So, you know, Rick Luce has been involved in it. Susan Nutter has been involved in it. Mike Keller has been involved with it. Paul Carrant has been involved with it. Early on from the beginning, a guy named John Evans, who's the co-founder of C-SPAN and has provided a lot of tactical, moral, and all other kinds of support has been involved in it. And today there are more than 50 institutions that are involved in it. And I would say the architects are 50 institutions. So it's designed from the beginning to be highly collaborative and I'll just try to now go through. Now, the other thing is we're gonna try to go through this. I'm gonna try to go through my part really quickly. Brenda's gonna try to go through her part really quickly and we're gonna try to leave time for questions at the end. But if you are just can't stand it, there's a question that you gotta ask. Go ahead and ask. My favorite teacher evaluation that I ever received said what could he do to improve the class? And the response was breathe occasionally. So I don't necessarily give the indication that it's okay to interrupt, but it's really okay to interrupt if you want. And again, I'm really eager to have all kinds of questions, including questions of skepticism, doubt, and all those other things. This is incredibly important. It's too important to screw up because we didn't wanna actually ask hard questions. So the problem today, and you are an expert audience, so I'll sort of go through this part really quickly, but remember, this is a pitch that was for the presidents. And the pitch, by the way, was made by a combination of Michael McCrobby, me and Ann Wolford. McCrobby did not use slides. He talked about why the president should care about this. The part that you're gonna see today are mostly my slides and did a brilliant job on describing why now is the time leveraging the progress that's been made in libraries and about the importance of collaboration with CIOs. Brenda's gonna talk about sort of that same kind of thing with a little different slant, so here we go. So the problem is that the scholarship that's being produced today is at risk of being lost forever to future generations. Forever to future generations. And we're not talking boutique scholarship. We're not talking unusual stuff. We're talking everything. So it's true for traditional content. Why? Well, because among other things, digital books and journals require different, more active strategies to maintain. One of the great things about books is printed on high quality paper, stuck in a controlled environment with a couple of copies in different places to keep them safe. And you can pretty much ignore them for 500 years with a lot of confidence. Ignore bits for five years and you risk great peril. So they require much more active strategies to maintain. It's true for scholarship that's emerging in ways that we don't even understand what the product is. I mean, around text. We at least understand from a scholarly perspective what it is we've had 400 years to understand what the book is. And yeah, digital books are a little bit different and that's a little bit of a twist, but it doesn't break your mind to think that way. But the things that are emerging with new forms of scholarship, new forms of, for me scholarship is about public communication. It is a dialogue that morphs across different content areas. New forms of scholarship, we don't even understand. How should we think about a project like Nines, which is essentially the online home for Victorian literature criticism? How should we think about Twitter? If scholarship is an online public conversation, does that mean Twitter is now part of the scholarly record? Well, I don't know. It's pretty new, hadn't been around very long, but we're gonna have to find ways to understand that. It is true for data with a vengeance, right? Where the only two things that I would stipulate that we know about data is that the volumes are increasing. I read an estimate yesterday that it said that the volume of data is expected to increase by 50% every year, which means it doubles every two years. The volume of recorded data doubles every two years. So we know it's increasing and we know that the other thing that's increasing are the demands for access and preservation, the requirements. So I don't know if you've looked at data management plans right now. They're kind of a war shock. They won't be a war shock five and 10 years from now. They're gonna get more and more explicit about not only providing access to the data now, but going into the future. And again, just as an example of what I mean by data, so the LHCs at CERN produce 10 to 15 petabytes of data annually. It only runs six months a year, thank heavens. But 13 petabytes on average per year. Much of the data that come out of these experiments will be thrown away. In fact, one of the big data management strategies discerning what you wanna keep, what you not wanna keep, right? Most of the data that's being produced is machine data intended for machines before it gets to humans, right? So much of it's gonna be thrown away, but some of it you sure don't wanna throw away. So, the most recent supernova in our galaxy happened in 1604 and was viewed by Kepler and described by Kepler. It was 13,000 light years away. The next closest supernova to happen happened in 1987, 400 years later, a mere 130,000, oh, 163,000 light years away. Now, those data we wanna keep forever, they are irreplaceable. So how do we do this? How are we gonna discern this? How are we gonna find strategies? All very hard. Not only is the scholarship at risk of being lost, but only universities can solve this problem. Only universities can solve this problem. And this was McRobbie's main theme when he talked to the AAU presidents. Google's not gonna solve it. Google may offer to solve it. Lots of companies are gonna offer to solve it. But Google's not gonna solve it, Elsevier's not gonna solve it, right? And the reasons are multiple. One is companies just don't have the life expectancy that universities have. So one of the points that Michael made to the presidents was he said, John Chambers, who's an alum of Indiana, said that when Cisco was founded, there were 10 networking companies vying. Today, there are three of those companies left. Universities last for centuries. Our universities are young by European standards. We are aimed for centuries. The other thing is that we're the only ones who have the mission to preserve this for the sake of preservation. It is in our self-interest, we are the scholars who produce and mine it. We're also charged by society for it. And one of my favorite comments from Paul Courant, we are perfectly comfortable taking on businesses where we may not actually see where the profit line is. In fact, he once described universities as we are a collection of businesses that each lose individually and we make it up on the volume, right? Which is kind of sort of true, right? If you look at what your research expenditures are, they don't actually make it. In-state student doesn't make it. We count on philanthropy and a few other things to sort of smooth the edges out. But the fact that there's not a quarterly profit in it doesn't surprise us, right? That's not good. One screen went out, but one still go, oh, now we're both good. Never dealt. Okay, so, only universities can do it. So what's the current state of play? Well, the current state of play, as you know better than anybody is, there's lots of activity. There are lots of repositories. Actually, there was a period in which, please pay attention to that screen over there, where we all, or that one, threw up repositories on an almost weekly basis, right? Everybody threw up their own repositories. Over time, one of the things that we started to see is aggregation, right? Because in fact, if you think about it, for most repositories, individual institutions doesn't make a lot of sense, right? There are great economies of scale in this space. Hottie Trust would be my poster child for aggregation, right? Hottie Trust was born because suddenly a bunch of institutions were gonna catch a bunch of digital objects coming out of the Google Book Scanning Project, rather than building them all separately. We went ahead and built Hottie Trust together. We put it all there, we count on it, right? Aggregation. You see the same kinds of things in other areas. So you start to see aggregation around other types of content, so for example, Shola. So lots of digital collections with a smattering, growing amount of aggregation driven largely by economics, playing out across different content areas and object types, with most of the emphasis on current access and little more than promissory nods towards long-term preservation. Less true in the library community specifically, but a lot of this aggregation is happening outside the library. So if you are in big data, you are talking to the disciplines, not to the librarians often, right? And so we'll give some examples of that in a minute. And most importantly, all of these aggregation efforts susceptible to multiple single points of failure, multiple single points of failure. And at this, so what do I mean by multiple single points of failure? Well, so you can have technical failure. Late 1800s, the Rotunda, the library at the University of Virginia burned to the ground, right? It's not a new problem, these technical failures, right? So you can have political failure. So the library at Alexandria is reborn with a heavy emphasis on digital. Whether it ultimately survives the end outcome of the Arab Spring is still unclear because today that library is seen by some as both a sign of political corruption and it challenges world beliefs. It's contents challenge world beliefs. So it's not clear whether it will survive it. And then there's the failure of will, funding, effort, the most common failure we face. So the Sloan Digital Sky Survey is my poster child for this. The Sloan Digital Sky Survey has mapped more than 930,000 galaxies. It's taken eight years to collect the data. 2,000 articles have been written based on the data in the Sloan Digital Sky Survey. 70,000 citations have been done for it, right? It's an amazing resource. And it's data we wanna keep forever. And there are efforts to preserve it. So Johns Hopkins and University of Chicago both have preservation efforts going on the Sloan Digital Sky Survey. Here's the problem. Those preservation efforts are funded through 2013. And almost certainly Sloan will come forward and fund it for another five years. But what happens if suddenly Sloan decides cancer is the real thing, not sky, right? How do we organize around it? More importantly, how do we push the preservation envelope out beyond 2013 to say, oh, I don't know, 3013. Because these data we don't ever wanna lose, right? And this is the most common form of failure. A failure to find a sustainable, scalable solution. At this point, we reminded the presidents that we've seen this picture before. It looks incredibly like the networking world looked like pre-Internet 2, before the academy built its own high-speed research backbone to fuel discovery. And in fact, the tagline we used with the presidents was, in the same way that Internet 2 fuels discovery, deep in needs to catch and preserve those discoveries. They're flip sides to the same coin. So it's what networking looked like a lot before. So the sense of deja vu is, the landscape today for digital preservation, there are lots of one-off solutions. There's emerging aggregation. There are multiple single points of failure. And the trajectory looks very much like the networking world look. Many layers to the problem. Networking is not just plugging it in. There's hardware layers, there's addressing layers, there's software layers, there's identity layers, there's service layers on top of it. It is a many-layered and complicated project problem, and it's constantly moving. Huge cost advantages accrue to scaled solutions. The one great lesson we've learned from this network era is scale matters. Matters tremendously. While we can buy commercial services, it's not clear what we're getting. Everybody can slap a preservation logo on it, but whether you're in a format that'll last, or whether the vendor will last, it's not clear. And we risk recreating the lock-in problem that we've seen around publishing in the data space, especially. And waiting only makes the problem harder and more expensive to solve. It's not like waiting is gonna make it any easier, or that there's a solution waiting out there in the wings. So we're either gonna solve this problem institution by institution at great expense, and with little chance of the solutions lasting, or we'll solve it together at scale, just like we did for high-performance networks. That was the essence of the pitch to the AU presidents, but of course there's more. So just a quick thing about single points of failure. So how do you deal with single points of failure? So one of the things that Deepin's done is taken a lesson from the design of the information systems of the space shuttle. So when the space shuttle flew, it's automated information systems. NASA sent out a set of specs, and awarded three contact tracks to three different vendors. And those vendors had to design to the same common specs. They could not share hardware, they could not share software, they could not share organizations. They had to be different companies. And when the shuttle flew and it made decisions, at least two out of three of those systems had to agree. The lesson Deepin's takes from that is the key to long-term sustainability. On the technical front is replicated diversity. Not just replicated diversity across geographic locations, which we all do, but replicated across the very things that you think might cause single points of failure. And I'll talk about that in just a second. So the goal is essentially to create, starting with what I would call an archive backbone, eliminating single points of failure by building in replication diversity from the beginning, and to create a sustainable framework at scale that evolves to adapt to new preservation opportunities and challenges. With me? Okay, so the proposal is create a federated digital preservation network, Deepin, owned by and for the academy. Right, just like high performance networking. To ensure that the objects and metadata of research and scholarship were replicated and preserved across, here's the diversity. You want diversity across software architectures. You want diversity across organizations. You want diversity across geographic regions. Again, this one most of us cover. Hottie Trust, for example, has replicated diversity across geographic regions, but not across software architectures and not across organizations, right? And you want replication ultimately across political environments. So important point, although the conversations about Deepin have been US centric in the beginning. That was, in fact, because we started with text and we started with copyright, and we went, oh my God, just put it on hold right now. Let's deal with what we can deal with in the United States where we understand the law, and then we'll worry about the rest of the world where the other laws change. The minute you get to data, right, so always the ambition to go away from being US centric, the minute you get to data, you are in a global environment already. So although the list of Deepin members and everything right now is US centric, that's merely an indication of where we are in the startup process, not where it needs to go. Ensure sustainability and longevity by building a framework that can scale and evolve over time, formats, and organizations. Deepin is about building a complex adaptive system. It's not about a building a piece of software, right? Rigorously and continuously audit and verify succession capabilities, right? And we can talk more about what we can do with that if we need to handle that. So it gets really fast now. So conceptually, what does it look like? Well, the Digital Preservation Network conceptually. You basically have two kinds of things. You have these nodes out here. This is, right, the light blue. This is supposed to be a plane that cuts it and these are all below it. We're still working on graphics. But basically, this is what exists right now where you have nodes that are starting to get aggregation. The idea was we would turn some of those into contributing nodes. Contributing nodes mean that they want to deposit their stuff into Deepin, which runs replicating nodes. Contributing nodes worry about access. They worry about what their content, what they're collecting. They're doing all the things it's already doing. Deepin tries to leverage the activity that's going on right now, not replace it. So let's look at a couple of examples. So with text, it's really easy conceptually. And we're in conversation with folks. Stanford's agreed to run a replicating node. AP Trust, an organization that's being spun up right now, has contributed to run a replicating node on Fedora architecture, which is different from the architecture that SDR uses. And we're in conversations with Hottie Trust about whether Hottie Trust will run a replicating node. The idea behind this would be contributing nodes, bring their stuff in, in the case of Hottie Trust. It's the Google Books. In the case of AP Trust, it's much more likely to be electronic thesis and dissertations, other kinds of things, not the Google Books. Again, we're not trying to go out and build another Hottie Trust. We want to leverage Hottie Trust and provide another level of insurance behind it, not replace it. So with text, we could actually imagine pretty quickly bringing up a triad, right? The basic notion in Deepin is, you want things replicated in diversity, so you need to try it. Different architectures, different organizations, different geographic locations, meets all the requirements except for political so far. But we'll work on that. Remember, and that's because we started with text and went, eh, law's complicated, we'll start here. And then the logic would be just to roll out to other kinds of content areas. So in rich text, you could imagine, you've got Shoah, they've got to be running some kind of architecture, don't know what it is yet. You've got C-SPAN, they're running, their archives are, I know, are at Purdue beyond that. I don't know anything about it, but they're running an architecture. And again, some of what the AP Trust may go into is also some video stuff. So again, you could create a triad based on content and large data. Conversations right now are going on with lots of folks about how we engage in data. So these are just purely hypothetical examples right now, but you know, the high-energy physicists have an Atlas data management system? Well, what about, and in astronomy, you've got the virtual observatory network, in earth sciences, you've got data one, can you start, and these are all, for the most part, focused on access, right? So can you have it engage them, the disciplines? Now you don't have to go out and take it back from them. You say, can we engage you, the disciplines, to make sure that your stuff gets replicated safely. And you, by the way, stay in complete control of your stuff, including access, unless of course you go out of business and there's some reason why it might be lost. And then the academy owns it. So part of the succession is about making sure that the succession of rights to decide lighting up and stuff go forward in the event of failure. So Deepin's not a software project, it's an ecosystem. It's a complex adaptive system designed to evolve to new forms of scholarship, changing demands, the evolution of software and technology. The one thing that we know is whatever software we build, this is an important point to remember too. Whatever software is built to push Deepin version 1.0 out on, that won't be what we're running 10 years from now. And it will look like it sucks in comparison to what we're running 10 years from now. Just like the first network that Internet2 ran out, looks pathetic in hindsight. It's about evolving and growing, right? Deepin is a federation, it's not a monopoly. You wouldn't want to build a monopoly if you buy the logic of Deepin, right? Because that would in fact be a single point of failure. Deepin brings efficiency by leveraging the deep diverse ecosystem that's already evolved, not by replacing it. And the primary functions of the Deepin Federation are to audit and verify, to provide grant-based and contract funding to the replicating nodes in a manner that ensures functional independence. So replicating nodes are much like in Internet2, contracts with Indiana University in this case, to run their global knock. Deepin will contract with organizations to run the replicating node part. Won't replace the operating expenses that Hottie Trust has for what they're already doing, but the replicating part that comes on, the audit verification, all that, Deepin will have to fund. Provide a legal framework for holding succession rights. Again, I sometimes compare Deepin to the Roche Motel. The idea is things check in, they don't check out. Short of a court order, right? Provide a structural organization for aligning and leveraging preservation investment activities. Can we afford Deepin? Well, the first question is how much is Deepin gonna cost? Don't know. Don't know, but a reasonable estimate might be $15 million a year as a starting place. What's that based on? Well, in scope, the ambition of Deepin is equal to the ambition of Internet2. It's actually a more complicated problem, I would argue, than Internet2. Internet2 needed about $15 million a year when it first started out. For another way to think about it is, well, you're gonna have to get these suckers talking to each other, right? Each of those is a software project, sort of like a koala-scale project. So you got three of them, koala-scale projects typically burn $4 million a year in startup phase. So you're at $12 million right there. So $15 million as a ballpark at initial startup. Can we afford it? Remember, what you're affording is funding, verifying, migrating, holding succession rights and paying for storage, not on the slide, but paying for storage, the extra storage that replication requires. Can you afford it? Well, at $15 million, the annual costs at startup are five, 10,000ths of a percent of the combined research, annual research expenditures of the R1 institutions. It is chump change. It is rounding error, particularly given at what's at stake, right? And importantly, if you build Deepin, you can start accessing other revenue streams. The feds won't pay for ongoing operations. The feds will invest one-time capital in new equipment all over the place. But then it's on you to fund its ongoing operation and replacement. Deepin embraces that. Deepin starts with the assumption that only the academy in the end is gonna care enough about this to fund it for the long haul. So the pitch to the presidents was, we're gonna have to pay for this, but if we go ahead and embrace that, we can then leverage these other funding opportunities instead of doing it the other way around. What we do now is we go for the funds and then we say, oh, we promise we'll work out the long-term sustainability. Deepin tries to hit the sustainability from the front end or at least aggressively embrace the notion that we own the problem, right? So Deepin benefits, preserve scholarship for future generations, right? That's why you should want it. Why do you need it? Well, ensure that Deepin members have continued access to the scholarship of the academy in the event of one or more failures. That's an important point. Absent that point, what you have is a classic tragedy of the commons, prisoners, dilemma, common, right? It's a common public good problem, right? Internet too was easier in some ways because you could say to the presidents, man, if your scientist can't get to our networks, you're toast. Under this problem, preservation, it's so much a public good problem that the really smart thing to tell every president to do would be, well, just sit back and hope somebody else solves it because it's not gonna come to burden your watch, right? So this tries to actually make that a little more tangible. You want an insurance, but it's about risk management. It's about risk management for your institution. Devolved to include new forms of scholarship data as they emerge, rationalize our collective investments in preservation efforts and leverage diverse funding sources, create a framework against the which the academy can reach all publication workflows for a digital world. Totally different talk, that's the core problem. We understand digital work, we, faculty, understand digital work, understand workflows in the analog world. They understand them so well that they're convinced that's the way God intended it to be, right? Or Darwin, depending on their preferences, right? We don't have those for the digital workflow. Blows everything up. Peer review in an analog world makes perfect sense because peer review shows up every time you have to make a large economic investment. When the marginal cost of publishing in the digital world is zero, where does peer review show up in that, right? And peer review, by the way, isn't gonna go away. It is part of what it means to be an academic, but it's no longer buttressed in the digital world by the same kinds of economic drivers that buttressed it in the analog world, right? Provide a way of planning campus-based cyber infrastructures that efficiently feeds preservation efforts. So for example, at Virginia, what we're thinking about is central IT provides spinning discs, huge parking lots that faculty completely determine when the car goes in, what color it is, who gets to ride with it, how long it stays in the parking lot, all those other things. At some point, they move on and they want to hand it over to the library or to one of their disciplines for longer-term curation, right? And some of that content, at the academy level, we want to make sure it's backed up forever. So it starts with right away of thinking about winnowing and funneling content. This is their list currently. We'll come back to this. How badly did that go over? And here we are, Brenda. Perfect, thank you, pressing the wrong one. All right, so here's my outline. Gonna talk a little bit about deep end, especially from a library perspective, why it might be right for your institution and give you some of the key considerations that you might consider, and then an effort to try to help you think through those questions, give, use Indiana University as a case study for why you might consider deep end. So, why deep end? What James said. Now I could have put the mad hatter up there had I known, but I didn't know. So obviously, he just done a great job of defining the problem and the current state of play and the challenges that we face and therefore why we need deep end. And he's talked about it as something that's by and for the academy, which it absolutely is. But my perspective here is to talk a little bit from an academic library and ARL library perspective. As James has already mentioned, we have done some things pretty well in terms of, for example, some of these existing library digital preservation models. We've done a pretty good job of geographic replication. Hottie Truss, he mentioned, replicated at both Michigan and Indiana. Chronopolis, three geographically distributed copies of the data at San Diego Super Computer Center, National Center for Atmospheric Research and the University of Maryland. Publish your journal content, although we know there's imperfect flaws in terms of the content and how much content we have in clocks in Portico. Clocks as the content stored around the world, China, Japan, Italy, UK, Canada, multiple sites in the United States. So we've done that, we understand that. We understand the need for geographic replication. Libraries, also, librarians, ARL libraries, we know, we've read that about 90% of the data in the world was created in the past two years. We get that. We understand that the data curation life cycle starting with the creation of digital objects, the access, the use, the preservation, the storage, the reuse and transformation. And we really do get that there's a shortage of data scientists and curators. So recently we've been addressing some of those problems by hoping to create more of a workforce in the data curation area. Some of you might have heard yesterday the description of the clear DLF data curation fellowships that are recently being created with the assistance of the Sloan Foundation and the institutions who are participating. There are six of our colleagues who are currently recruiting for data curation fellows, Purdue, UCLA, Michigan, Indiana, McMaster, and Lehigh. Of course, Jim Mullins and his colleagues talked about the ARLE Science Institute and the efforts there. And there are many other things. As you see up there, the UK Digital Curation Center, the Goodworth Library of Congress is done and NSF. But what are some of the things that you might consider when you're thinking about whether or why you should be involved in deep end? Well, I would say the first question should be what kinds of born digital data does your institution create? And here you should think of a whole variety of things, lecture cast, music and arts performances, science and social sciences data sets, specialty scientific instrument output, sequencers, telescopes, sensor networks. And then the next question needs to be how much of that data needs long-term digital preservation? And here I think I would encourage you to think about the uniqueness of the data, can it be regenerated and if so, at what cost? And can it be reused for new knowledge creation? Then I think you need to ask yourselves what is your local capacity to hold that type of data? Are your purchases measured in terabytes or petabytes? Are you prepared for exabytes and zettabytes? And I understand the next term is yattabytes. And if you're anything like me, you thought it was from a Seinfeld skit. You could just picture, hey, someone going, hey, Jerry, how much storage do you need? I don't know, yada, yada, yattabytes. But apparently it actually means something. It means one septillion bytes or one quadrillion gigabytes. And we really are reaching that point where we can talk about yattabytes. Is your local data utility available across high-speed networks that enable mass data transfer? Is your data repository connected to the new internet to 100 gigabyte network? Again, if you're anything like me, you probably don't know the answers to these questions. So we hope that you are talking to your IT folk. As James has said already, and he said to me, the places where the IT in the library work and collaborate effectively will be those that succeed and thrive. And as he also mentioned, we often have different perspectives in the library and our IT colleagues, and to them, scale matters every time, as he said to me. So we do need to be working together. And then the last question you might ask yourselves is if you participate in something like Deepin, how could your local data curation and preservation resources be used more effectively? So here I'm just encouraging you to think about whether you could do more, which I'm sure you're already doing some of this, capturing arts, humanities, social sciences, and science content closer to the source or even directly from the experiment with more original provenance. Or we could think about libraries targeting digital collection specific content in a federated fashion, like we're doing with some of our shared print repositories such as West or the CIC shared print repository or Acerol. So in the ad to ask itself very similar questions, we were pondering whether to participate in Deepin. Why is it important for IU to be a member of Deepin? What kinds of data do we produce and what does IU want to preserve? What is the local infrastructure for storage at IU? And why does IU need Deepin to serve as a long-term preservation utility? I'm gonna talk a little bit about some examples. And here I'm not trying to impress you with the kind of data that IU has, but rather to encourage you to think about what I'm sure all of you have comparable, similar kinds of data at your institution. But I'm trying to capture here a variety of different kinds of data that we all have at our institution. Big data is exploring, it's exploding, exploring whatever, it's exploding at IU. We all have digital library collections and here we have digitized collections and photo collections and journals that we publish and video streaming. IU's case, we've got about 40 terabytes in our digital library collections. That's the stuff that we tend to think of and know about. A couple of years ago, in fact, when Pat Steele was still at IU, the campus undertook a large media preservation survey and discovered that there are about three million sound and moving image recordings and photos and documents. From all over the campus, athletics department, Latin American Music Center, the Anthropology Department, et cetera, all have audio, video, film that need to be preserved and stored, field recordings, athletic events, musical performances. We estimate that that represents about 10 petabytes of data. Moving to something a little different along the lines of what James had mentioned, telescope kinds of data. WIN stands for Wisconsin, Indiana, Yale, National Optical Astronomy Observatory. The telescopes are located in Arizona, but the data is stored in Indiana. And each night, they're capturing about 500 gigabytes per night. So imagine how much data that is. With the University of Illinois, Indiana has launched the Hottie Trust Research Center and we're estimating when that gets going that the derived data from those coming from the Hottie Trust corpus will in the initial 18 months be about 500 terabytes. We're also involved with other institution in a data net grant on sustainability data, another 100 terabytes. Genome analysis, peta scale. And then there are things like I recently met with faculty from the Center for the Study of Global Change and they told me about a project called Muslim Voices. And it's amazing, amazing website, blog, a whole variety of things. They capture videos and podcasts and exhibits and discussions. It's meant to be a dialogue to create understanding between Muslims and non-Muslims. I am certain they're not thinking about storage of that data, but it's important and we need to think about that sort of data as well. So then we ask ourself what kind of local infrastructure for storage do we have at IU and actually we're very well supported. We have a research file system that supports 48 terabytes, a scholarly data archive, 15 petabytes. But it doesn't take a math major to look at the things that are above the line and those are just a few of the kinds of data that we have and then look at the capacity for local infrastructure and realize storage that we don't have enough storage to support that. So this is just a picture of the Muslim Voices site that I mentioned. So then we also lastly said, okay, long-term network-centric data preservation such as deepened, what does it enable us to do with our local data preservation resources? Here we're thinking about web crawling, capturing local content like state politics or cultural events, short-term web content like campus events, cultural events, data capture at source with provenance, regional or collection specific content and within our institutional data publishing, including multimedia content for dissertations and articles. So those are just a few of the things that you might do with your local resources if you're involved in deepen. So lastly, for IU another big deciding point was that our president, Michael McRobbie was the former vice president for IT and a CIO. And I think James has already captured for you that in a recent Educa's review article that he reminded us that the IT marketplace is the opposite of long-term stability. That their strength is innovation, not stability. And he recalled for us that once dominant IT companies that no longer exist include the likes of Sun, Decc, Compact, Commodore, Atari, the list went on. This industry is not and should not be the long-term custodian of the data and knowledge of our universities. So you can see from that quote up here that he talks about researchers not always being very good at long-term preservation and curation of data. And then the question is who should be responsible for that? And of course he is suggesting and saying it needs to be universities. Already mentioned by James is this is a problem too large for just libraries to tackle. It's library centric but not library exclusive. We need to have the presidents of our universities on board for many, many reasons, not the least of which is financial contribution. But we need their support and hopefully they're understanding of what the issues are. So I'm hopeful that the AAU meeting recently would help to advance their knowledge in that area. And next I'm going to turn it back over to James. OK, so we have just three more slides. So we have momentum right now. The meeting with the AAU presidents went very well. We have a champion and Michael McRobbie. We have a number of presidents who have already indicated that they're interested in working on. Basically what we're asking AAU to help us with is to think about not just AAU. It's research intensive. AAU, to be clear, we went to AAU because it's a heavy concentration of research universities. It was a great way to get to a group of presidents. But ACE is already actually signed on to help support Deepin. We're in conversations with APLU. In the end, it is going to be the research intensives. It's going to be your institutions that are going to carry this. Lots of institutions won't ever step up. It's not their mission. They're not positioned. So we asked the presidents to help us figure out what the governance structure should be. One of the assumptions behind Deepin that it takes from the governance of internet 2 is in the end it is a very library-centric problem, much like internet 2 is a CIO-centric problem. But the stewardship of that problem is a university problem. And the stewardship of that problem must rest with the presidents. And that was the end of the pitch that we made to the AAU presidents. Hunter and John Vaughn are going to be working with us and Michael to identify a small working group of presidents. We're going to look at different structures. Right now, Deepin is being sheltered out of internet 2. They're handling the logistics, the auditing, the invoicing, all that kind of stuff. Not clear where the ultimate home is going to be for Deepin. It's not part of internet 2 right now. It is a close alliance with internet 2. We're going to actually look to the presidents to say, how should we do the robust governance? Deepin was featured. I was in France. But Deepin was featured heavily at internet 2. Three presidents or former presidents specifically talked about the importance of Deepin in the plenary session. So McRobbie talked about it. Mary Sue Coleman talked about it. Molly Broad talked about it. We have momentum. We have 50 institutions signed up so far. Feel free to sign up if you want. We have 50 institutions signed up. We have capital. We have leads identified from institutions. There's going to be a Deepin meeting on Friday that's going to really focus on moving more towards operational and on this theme of momentum. Now is the moment. We have 18 months to show something. Or else Deepin will just be something that was remembered as a flash that went by and was forgotten, in my opinion. It's really easy to let this happen. It's really easy to decide that it's a really complicated problem. And we've got to figure out governance. And we've got to figure out technical. And we've got to figure out. And we need another study, or maybe five or six studies, to really answer the problem. That is the way towards paralysis, I believe. Deepin needs to move with the notion that we're going to roll and evolve. And we're going to have the stewardship from the presidents and engagement from the library community and the disciplines growing. And we've got to keep moving. Because once you lose momentum, you don't get it back. It's really hard to get it back. So my pitch is, this is a moment in time for the academy to solve this problem, right? Now and forever. Thank you for listening. Music was provided by Josh Woodward. For more talks from this meeting, please visit www.arl.org.