 Okay. Hi everybody. Welcome to Building Data Refuge from Bucket Brigade to Sustainable Action. I'll give you a little overview of what we're going to do here today. I'm going to be talking a little bit about the history and then we're going to talk about the events, how we're going to build sustainable efforts and the Libraries Plus Network. My name is Catherine Morse. I'm a government information law and political science librarian at the University of Michigan. And to give you a little bit of history, for over 200 years the U.S. federal government has sent out their publications to distributed libraries all across the country. And that network was organized through the federal depository library program. And it didn't include everything that the government produced. It was mainly publications, not so much data sets or records, but it was a start. And libraries accepted those things. They created the metadata for them. They made them discoverable. Libraries worked on access and preservation. But now the paradigm has really shifted and most government information is more digital. So we don't have a new problem, but we have more of an urgent problem of how are we going to continue to create metadata, make, provide access, and preserve government information. So the chart I have over here can give you an idea of the scale of this preservation challenge. So the really small bar there is all FDLP items, which we estimate to be like maybe around 3 million things. If you've got a really big collection going way far back, maybe have 3 million items. And that's quite a lot. And there's still a lot of work to be done in figuring out what those things are and how do we make them accessible and how do we preserve them. And then the other two big bars you see come from the 2008 and 2012 end of term web archiving initiative. So in 2008, the end of term program got 164 million URLs. And in 2012, 194 million URLs. And of course there's 2016 data too. So a question that people often ask is, isn't the government taking care of all this? And the sad answer is unfortunately, no, there isn't one government agency that is in charge of all of this. There isn't one government agency creating mandates that everyone else has to follow. There are agencies that do a lot of work in this area, but there isn't one agency coordinating everything. So GPO, the government publishing office, Library of Congress, National Archives, they all do work with archiving, born digital information. But agencies that produce content are not compelled to work with GPO. GPO isn't aware of everything that gets created. And there isn't an agency that keeps track of every URL that the government creates or maintains. The GSA creates a list of parent domains, but there is no list of subdomains. So that creates a problem. And it's not a new problem, but it's a growing problem. And so if you remember way back to the great government shutdown of 2013, it was really alarming for people like me who work with researchers who rely on this data, because it happened really fast. And it was also a big surprise to us of what agencies were going to take down their websites and which agencies were going to leave them up. We didn't know until it happened. So those were really uncomfortable few weeks. And these are examples of what some of the splash pages looked like when for the agencies that took down their information. So the question I have as somebody who works directly with researchers is how can we help them use this content that we know they want to use? So as we heard in the keynote speech that Allison had gave, undergraduates really rely on .gov websites. And we know that our researchers and our faculty rely on them too. They're the basis of so much of their research. And it's really difficult to explain to researchers about how they can proceed with finding content that was at one time really easily findable, but is now no longer findable. So that's why I feel like this issue is so important and why I wanted to work along with my colleague Justin Shell in the data rescue event we had, Nan Arbor. And Elizabeth is going to talk about the data rescue event she had at Georgetown. Hi, everyone. My name is Elizabeth Foster. I am the public policy and social sciences librarian at Georgetown University. And I'm going to give a brief account of how we pulled together data rescue DC with community stakeholders in under 12 days. So why we got involved? Georgetown has a strong curricular focus on climate and environmental issues. We're home to the Georgetown environment initiative, which is an interdisciplinary collaborative focusing on environment issues. And during the 2017 school year, we have a cluster of courses that focus on climate change issues. So our faculty are actively researching this and our students are actively studying it. In addition, Georgetown University is a heavy user of federal data. In my capacity as the public policy liaison, I work with several programs that require students to use open data to complete an empirical thesis project. So we really have become accustomed to this data being available. And we felt there was a strong professional imperative to use our skill sets to preserve high quality information and make it accessible not only to our university community, but to the world at large. In addition, in DC, there was a strong sense that people wanted to do something about this. DC is unique in that we have no congressional representation. So we have really no one we can call about this particular issue to complain. But this gave us an activity that we could do to use activism to accomplish something really, really important. So we had a large gamut of feelings about this. Overall, I would say we were excited, but apprehensive. We were excited to use our skills to accomplish an entirely brand new thing. But at the same time, we didn't expect that we would have to do this. During my short professional career as a librarian, I've become accustomed to this data being available and being able to refer students to it whenever they need to use it. And it occurred to me that this might not be the case in the near future. That led to a strong feeling that this needed to be done correctly the first time. And it needed to be done quickly due to the rapid pace of change happening in the federal government. However, in my capacity as a liaison, I don't really work on digital projects and neither do my colleagues that were interested in this issue. So we really needed some guidance from people who had expertise in these areas. And that led us to finding community partners. So this came to be because my colleague saw a call for a venue, for a data rescue event that was in the nascent planning stages. We answered the call and agreed to serve as their venue and we split up the work. The work was split up between several NSDR residents and community stakeholders, Georgetown University Library, John Hopkins, and even more groups that I cannot fit on this slide. There was a overlap between the NSDR residents and Georgetown. Joe Carano is our NSDR fellow and he served as the liaison between the two groups when we were deciding who was going to do what. So sharing the work, Georgetown handled all the logistical, local arrangements, recruitment, promoting the event, staffing the welcome table, et cetera. Our partners were the technical folks. They handled all the guide training and managing the guides on the day of. They knew the workflow. They handled the event registration. Both groups were responsible for recruiting speakers and talking to the press. Day one was a day to set the scene. And we had a teaching on humanizing climate data, which I moderated, followed by a training session for guides who would be leading people through the various participant workflows and a roundtable discussion on open data and data vulnerability. You can see our speaker lineup on the website. Part of it was live streamed and we had a really great lineup, even though it was President's Day weekend and a lot of people really did not want to be in the library. That being said, these are the various paths that our participants and guides worked on. Guides picked a path that was relevant to their area of expertise and then led ordinary people through carrying out these activities. Day two, we focused on actually rescuing the data. We decided to focus on EPA data sets. We had a lot of energy and enthusiasm in our crowd. And I would say we had approximately 200 to 250 people participate over the course of two days. Our diverse crowd included federal employees, which required us to account for their privacy. So we presented them with name tags. If they did not want to be identified, they wore a blank name tag so they could come actively preserve these data sets that they had innate knowledge of and may have possibly helped create. In addition, faculty, students, scientists, information professionals, concerned citizens all turned out to help rescue data. If you're curious to see our handiwork, you can head to the Data Refuge website and sort down to data that was harvested during data rescue events and also pick the Environmental Protection Agency as the organization. This is just a sampling of some of the data sets that have been harvested since February. Our final stats, we harvested 20 gigabytes of data next time we will have ethernet available, so we will not have hang-ups when people are trying to download large data sets over Wi-Fi. We also completed several storytelling profiles, and when the slide deck is available, I encourage you to click on that link and find out why individuals decide to come out to data rescue DC and find out why this data matters to them and their communities. Over 200 individuals were trained, and we established an online community where people can communicate about this particular project and get instructions on how to work on these particular activities from their home or work. What's next for Georgetown? We plan on continuing what we've started, possibly hosting another data rescue event. We plan to work closely with the Libraries Plus Network, and there's been a strong interest in preserving social sciences data, so we're seeking out other opportunities, such as working closely with DataLumos, which is ICPSR's Social Sciences Government Data Project. And with that, I will turn things over to Delphine. Hi, I'm Delphine Kana, I'm the head of Digital Library Initiatives at Temple University. So Elizabeth gave us a brief overview so Elizabeth gave us a really nice illustration of how one event could unfold, and I'm going to talk a bit more generally about what we have achieved so far with the Data Refuge effort. What we could call the Data Refuge Phase 1, that is really the period going from December to March, just for months, and so all in all 34 events have happened so far, just like the one that Elizabeth described. Several more are coming up. The events were conceived as grassroots and open to the public, and they were co-organized with other groups, like for instance, AG, the Environment Data and Governance Initiative. So during those events, tens of thousands of URLs are seeded to the Internet Archive. Several terabytes of data were saved to cloud storage, some of which is now accessible to the Data Refuge Registry, and the rest of the data is still being processed. We, to support all this work, we developed workflows and some software infrastructure. Of course, we operated under strong constraints, which I would describe as extraordinary circumstances and a very, very short timeline. So a question we often get is, but do you follow Digital Preservation's best practices? I would say, well, some elements of best practices. We create valid bugs using Bayit. We've inserted quality control processes at several points in the workflow. We make sure to capture the datasets context and metadata, for instance, the HTML homepage where the dataset was found, any accompanying data dictionary, et cetera. We have a clear definition of distinct roles that make for the entire workflow, and the workflow is supported by a web-based application that helps people go from one role to another and to make sure that all steps are done. We wrote a lot of how-to documentation, and we explained key Digital Preservation concepts, such as the notion of having a trusted chain of custody, the use of checksum checks, the concept of locks, lots of copies, keep stuff safe, et cetera. And we addressed basic security concerns. However, certainly encountered plenty of challenges. The event participants were all wonderfully engaged and hard at work and just great, but there were different skill levels and different levels of understanding of digital curation and digital preservation practices. To mitigate this, we tried to orient people to roles that were tailored to their skill set. The organizers of the events and the workflow architects were also amazing people and did great things, but they were coming from different professional subcultures and sometimes had different assumptions. We're talking here about librarians and archivists, data scientists, software developers, people from the open data or civic data communities, et cetera. We were developing the workflow application that I mentioned in real time, that is, as events were already happening, which certainly brought its share of challenges. We also, for the public facing registry, we used the platform SCAN and we had to use it pretty much out of the box because we did not have the time to do any customization. To follow best practices, digital curation processes really need to be excruciatingly methodical and controlled, and that was not always easy to achieve under the circumstances I described. So is this the ideal workflow that we have right now? The answer is no. Stepping back for a moment, if our goal can be loosely described as wanting to replicate federal data onto non-federal servers, which, by the way, would be a good idea in general, regardless of the specific circumstances of 2017, because of the LOX principle, lots of copies keep stuff safe. But if we want, if that's our goal, ideally, really, it should happen much more upstream. That is, we should be working directly with federal agencies, with data.gov, NARA, et cetera. And if we could do that, we would obviously be much more effective. We would not be spending time trying to discover where the data is on websites or harvest. We would also be able to harvest all the data and all accompanying metadata in a systematic manner. And we would get information directly from the source about the, for instance, the update frequencies, which would be really helpful. And Laurie and Kim will talk a bit more about how we hope to do more of that in the future. However, remember that this project was developed under a lot of uncertainty with a very short timeline. And, you know, more generally, right now we are experiencing strange times, as you know. So some of the uncertainties that we were facing were, who could we work with? Who could we talk to? We did not want to put federal employees in a difficult position. What agencies were under a gag order? What agencies were restricted in their ability to collaborate on such a project? There was also a lot of contradictory information circulating. We heard pretty much the whole spectrum from data refuges useless, all the federal data is safe. Don't you worry. Another way to hearing about federal employees copying their agency's data onto portable hard drives and taking them home because they were just afraid the data would disappear. Or federal employees, as Elizabeth mentioned, attending data rescue sessions and expressing deep worries about the data. So some of those circumstances explain some of the workflow and architectural decisions we made. The central achievement of those data rescue sessions were definitely the awareness raising aspect. Over 2,000 people participated in the data rescue events. They were wonderfully engaged. They got the opportunity to learn about digital preservation, the data life cycle. They learned and spent time thinking about issues like the intersection of data preservation and societal choices and politics. They were passionate about preserving data, which was really great to see. As someone who has been doing digital preservation for quite a long time, usually you're not used to that topic of conversation, making people feel excited and passionate. It was great to see. There was a very large media coverage, and articles in the press, TV and radio programs. And we felt that raising awareness was as important as actually saving the data because the advocacy function is really crucial in this endeavor and also for the longer term. We need citizens and people to be aware of what's happening. Who did we reach? The research community, the citizens at large, my neighbors had heard about the project on the radio or they had read an article. The library and archive community and data science community, also many conversations happened and people were talking about it, and hopefully potential founders for later phases of the project. In conclusion, data refuge phase one was quite an extraordinary experience. Large amounts of data were saved and large numbers of URLs were cd to the Internet Archive. The events had the essential role in raising awareness and developing advocacy. And we hope that the next phase will allow to build, to take things to the next level and build a truly robust and sustainable workflow and infrastructure while finding ways to keep citizens meaningfully involved. Thank you. So I'm Laurie Allen. I'm assistant director for digital scholarship at the libraries at the University of Pennsylvania. And I've been so deeply involved in the data refuge and data rescue efforts, but I want to talk before we dig into the weeds a little bit of some of what we've learned about data preservation and data management and federal information. I want to talk a little bit about some of the efforts beyond ours, the folks we've grown to collaborate with or we know about their work just to contextualize this project a little bit among a number of projects that are operating in this area. Obviously the most well known is the End of Term Harvest. They're the old project in the group with their 12 years of experience. And that, as you probably know, is devoted to creating copies of federal web pages at moments of transition in presidencies. And so that project was a partner in ours and in every data rescue events, one of the activities was to feed the End of Term Harvest project and the work that they're doing to back up websites. Another project that we are fans of and we're collaborating with only in a kind of mutually supportive way was Climate Mirror. And that's a really grassroots effort where folks are basically kind of continuing what was started where we really formalized in data refuge a kind of more methodical approach. What the Climate Mirror people have done is said basically anyone out there who has a server and some ability can say I've got this data set, call it, say here I've mirrored it and that's that. And so we all I think hope that we will never need to use their data because we will want data that's more citable, that has a stronger chain of custody, but the Climate Mirror folks are just out there getting whatever they can. There's also a project called Asimuth which is focused on environmental and climate data and I know a little bit less about this one but I know that it's been ongoing for quite a while at UC Riverside and that they have been, again, a federal climate and environmental data for their own research and to ensure it's sort of long-term stability. A project I know a little bit more about is the work being done by the California Digital Library in collaboration with the DAT project and Max Ogden to create a mirror of data.gov. Data.gov, as you probably know, is basically an inventory, a catalog of many of the data sets that are produced and housed on federal websites. It certainly doesn't have all of them in the best listing that we have. So the folks at the DAT project working alongside the folks at California Digital Library have created a mirror and are working on creating access to that mirror of the data.gov database. There's also, as Elizabeth mentioned, ICPSR launched DataLumos which is a basically public participatory project to store and house federal social science data. Within the data rescue events, as they've continued, some universities or libraries or citizens have chosen to take on data outside of the realm of climate and environmental, but certainly the folks at DataLumos are really focused on the social science side of things and of course they have the 50-year history of ICPSR and all that expertise behind that project. And then one does need to mention NARA, the National Archives and Records Administration, who we know, you know, we would love to see them have the mandates, the regulatory system and the resources to provide additional backups of this data, but at this point, our understanding is that they don't have the frameworks to make sure that it's all kept safe. I am working, and this link will be live in the slides, I'm working on basically just a little lightweight inventory of these projects along with a couple of others on that, that URL. So, when we've been talking about federal data and I think, you know, as Delphine said, as we developed the workflow, we were developing it with every weekend five, six, seven events, three events, two events were happening. So it's very hard to really perfect a workflow when like you've got a couple of hundred people who need to use it this weekend. So, to back away from that a little bit and think about what have we learned, one thing is this question of, what do we mean when we say federal data? I think within the library community, we talk about data and we mean something specific, but in sort of the truth of practice, federal data can mean anything from a basic HTML website, which can be comfortably harvested by the end of term archive to something much more like a research data set that we would need, that comes with data dictionaries and metadata and all of that. Along that spectrum, there's things, there's web pages with embedded content that can't be picked up, a map with a data visualization on it, that that might be considered data to some communities and they might really need that data. And then there are directories of files. The federal government has a lot of FTP servers and those actually can go into internet archive, but they're a little bit of a different kind of data. And then there's the query interface. These are all the kind of intermediate steps. The query interface is one of the trickiest problems and it's one of the ones that the data refuge community has spent and the data rescue events have spent the most time on. That's this notion that there's data underneath that query interface that we need to pull out, wrap up, and store. So in terms of the sort of events, we try to send basic HTML pages and directories of files to the Wayback machine and query interfaces and research data sets to data refuge to be straightforward. That embedded content is not solved, but we do the best we can. Thinking about taking that query interface question a little bit further, if we think about how do we, that sort of acts in this workflow of how do we safeguard the data in a query interface? The ideal here is that you pull the data out of the database through the query interface until you have these data files. You combine those data files with metadata and HTML pages and you create something like a research data set. However, we know from our data management practices in libraries and elsewhere going back that that's not the ideal. The ideal is not to pull data out of an interface and then describe it, right? The ideal is to, before the data goes into the interface, to describe it and combine it with metadata and think of it as a data set at that point. And so when Delphine was talking about moving upstream, this is sort of where we've been considering and having conversations with the folks at data.gov and in agencies and trying to understand the way that they approach their data. And it's been an incredible, really profound learning experience to get to know the way that the open data community sees the production, the sort of data lifecycle and how different that is from the way that we have seen it in libraries, at least that I have seen it. So here we see that's my database at the bottom. And then it feeds a query interface and the query interface is used by a scientist to get what they need. And then that scientist maybe does something else and produces some data. And then that data gets fed back, probably, often in some cases, into the database. But the idea of this, which is the way we think in the library community about where the interface is just one output has really not been the way that the federal government has been producing data in many of the agencies. So once again, our work, as we think about moving forward, our work, I hope, will be in helping to move information along this spectrum from web page to research data set and to rethink our workflow so that instead of the workflow that Elizabeth described, that all these events have been using where describing happens really right there almost at the bottom, what we'd really like to see is that researching and describing and sharing all happen much more together. And we're looking for ways that libraries can work with data producers, either at our own institutions whose data is often going to feed these federal websites or at federal websites where they're being produced so that the data can be more easily saved and preserved. And so looking towards the future of how we're thinking this might happen, I'll turn it over to Kim. Hi, my name is Kim Ekke. I'm the director of teaching, research, and learning services at Penn Libraries. And so we heard about our analog to our digital present, our analog past, that a lot of different discreet events, some larger efforts that are happening. And I'm here to talk about the Libraries Plus Network, which is an effort, conceptually at least, to really integrate this work into our institutions, into the jobs that we do already. And so what we want to do is move from these grassroots efforts to really sustainable action in the long term. And as we think about federal data, we're really talking about just providing research quality copies of federal data across a trusted network of institutions. So our good colleagues, James A. and James R. Jacobs, have proposed some goals toward this effort. And one, I highly encourage you to read them. They're on the Libraries Network blog, and they invite your feedback and comments as well. But they talk about creating a digital government information library infrastructure that is grounded in international standards for long-term preservation and allows libraries to collect and make accessible federal data for designated communities. And the idea is if we pick a couple of communities and each person around here picks a couple, over time we will have everything covered. That is the logic. And they also talk about, at the front end, just like Lori was mentioning, having federal information perhaps use information management plans very much like data management plans for researchers, so that it is documented and specified what the data is, how it's described, and what the plans are for long-term preservation and access. We have been working on this for four months, approximately, I think, and what has been wonderfully overwhelming and encouraging is the outpouring of interest. And as these events popped up, more and more people would contact us and say, how do we help? How do we get involved? So we reached out immediately to ARL because it was a lot of people contacting us, and ARL's been helping us to coordinate the efforts. And over time, as more and more people get involved and more people are putting their minds into this, we are engaging with people far beyond the library community and people in data.gov and journalists and lawyers, and all of these things led us to move from the library's network as we originally conceived it to something that we're calling Libraries Plus. And we're calling it Libraries Plus because there really are a lot of diverse viewpoints and people that have a vested interest in the success and long-term sustainability of what we're talking about and also have so much talent to dedicate toward this. So we believe the future is Libraries Plus. And the things that we've done so far since about February, I think, is we've created a blog. We invite you to subscribe. We invite you to author posts. We've tried to create a U.S. agency coordination spreadsheet because there are so many things happening all over the place. I will confess that that's been less successful than we had hoped, but we have learned in the process. We tried to give ideas of how libraries are contributing and what paths other libraries might take based on the resources they have. Lori talked about experience. We're making videos. We have webcasts. And the Wii that I'm talking about is not the Wii just on this panel, but it's really across the nation. So many people are involved, and it's been very encouraging. So we've learned a lot and we've eaten a lot of jelly beans. And that has led us to do one thing. The way we have been operating is we've put a stake in the ground on a calendar date and we say, by that date, we're going to do X. And so in between, we figure it out all along the way. And one of the things we did last month or a couple months ago was pick May as a date to convene these diverse voices in a face-to-face meeting. And if it were not for the limitations of physical reality, stones and mortar, we would invite everyone to participate, but we do have to limit it. And that is, it's described and the participating organizations are listed in that URL. And the intent there is to, I think, suspend the reality of what you know and believe is absolutely correct and listen to a lot of alternative viewpoints and collectively define what the problem is that we're trying to address. Because there are a lot of different ideas out there. And to that end, we are having an open webinar slash open meeting on April 19th. The point of that meeting is to inform anyone who cares about what's been happening, what we intend to do at the May meeting. But more importantly, I think how to get feedback from our broader community. A lot of us believe that we are in a very special moment and it is up to us to take a good crisis, right, and put it to use. And so I think it's also important to recognize that this moment does have risks. And I think the biggest risk that we've seen is that there's a lot of ambiguity. There's no playbook. There's no and so being able to embrace that ambiguity I think is a skill that we all need to develop further. I think one of the risks we have in bringing really diverse communities together is talking and not necessarily hearing. And so feeling passionate about our view is important. But we know that if we get a group of passionate people and yet we haven't collectively agreed on what the problem is that we're solving and so that's another idea. There's a lot out there we want to build on what exists but it's possible that there is something alternatively new out there. There's the risk of the familiar just staying within our own little domain and talking to our friends, doing our same job, not changing what we're doing, being in an echo chamber and nobody wants that. And of course the biggest crime of all would be failing to act at all because this is the time. So I will finish up by saying that this is not some loopy platitude, right? Developing shared understanding is a lot of work. It's going to take time and it's going to take a lot of conversation and there's, I don't think any of us think that after the May meeting on May 10th, there's going to be a, you know, the clouds are going to break and the sun's going to shine through. We're going to have to keep working but I do think it's a first step. And we know we can do this because we already have. Look at all the activities that have happened around this country. The fact is nobody knew what they were doing and so much has been pulled off. You heard about all the data, right? So you figure it out and you keep figuring it out and we're getting all kinds of people working together. I will end with this quote out of Einstein and infield. But even though people say, I've heard people talking about, you know, we're sick of talking about what the problems are, I think even our keynote talked about how important it is to frame the question. What is the question we are asking? What is the problem that we are addressing? And are we speaking a common language in how we're going to address that? I think it's really exciting and interesting to get a group together to be curious and look into that. And last but not least, no matter where you come from, how long you've worked in this space or outside, every single person has something to contribute. The problem is, you know, big. So bring it. And with that, I think we are done. We welcome any questions that you might have. And we thank you in advance for what you will do toward this project in the future.