 People want to think that the things we've worked on are really important, right? And certainly they are, sometimes it's just a question of the scale and whether they become important and significant, and if you can be for lots of reasons, maybe you just continue to grow something over time, you have a favorite audience, you have more content on your website, or maybe there's a presidential election. All kinds of things bring a wealthy doing to different focus. The Freedom of Information Act archive has the potential to be something even more interesting than we thought it was several months ago. And we're going to be able to share a bit of that with you today here. As with many of you, we think this is something very, very special, but then again, doesn't everyone working on a digital initiative think it's something unique and very, very special? Of course they all are. But actually they also have a ton of things in common, and you probably know that too. So in this case, the Freedom of Information Act archive, we think is a terrific example of a faculty-led project that has a lot of potential to grow in a lot of different ways, maybe even beyond the scholarly community. But it's reached that point. It's a point of transition. It's been out there at launch over this last summer, and now it's going from a project to something we hope is much, much bigger and it's going to face some challenges, and that's what we're going to talk with you about with you today. So in this session, I will stand away from the mic but so you can see the presenters. We have folks from Columbia University, senior librarians, and also a faculty member who created this. So, you know, tonight, in the role of young, enthusiastic, idealistic librarian who wants to see a digital initiative succeed for the good of civilization and the digital humanities, we have Barri Alkenbach, the associate university librarian for collections and services. Tonight playing the role of her somewhat cynical and jaded technology colleague who has seen lots of projects like this. We have Rob Cardoano, vice president for digital programs and technology services. And finally, rounding out our cast is going to be our leading man, the hero of this quest for sustainability. We're going to travel on this path with him, and this is Matt Connelly, professor of history at Columbia University. So we hope to hear a lot from you after this presentation. We really are doing this as a way to share it with you and then to hear back, to get some feedback on what these next steps will be. So without further ado. Great. Thank you, Nancy. And thank you all for joining us for this last session. As Nancy said, I'm Barbara and I'm the interim university librarian, associate university librarian for collections and services at Columbia. In about 2012, professor Matt Connelly approached me. I was newly in the organization at Columbia as the director of the humanities and history libraries with a mandate to grow digital humanities. And professor Connelly approached my team, the humanities team, about a data set that he needed for the first ever DH course, digital humanities course at Columbia, called Hacking the Archive. It was a co-taught course with a professor in English and they needed a data set from the U.S. National Archives to really be sort of a pedagogical tool for this class in which they built something they called the declassification engine, which has now become the freedom of information archive, or at least part of it. And you'll hear more about that in a moment. So when we heard from professor Connelly, we thought this is fantastic. We want to buy a collection that will be used in a course. It's a new form of engagement for our liaison librarians to be working with a faculty member in partnership to support a course and to really be pushing forward this kind of new and emerging field of digital humanities. We were also really interested in it because it had tie into the building of our Studio at Butler program. Our Digital Humanities Librarian attended every single session of the Hacking the Archive course. It was in our then the Digital Humanities Center, but also grew into some programming in our Studio at Butler. So it really did sort of advance our program. Additionally, we were also able to see this project in alignment with university goals. Obviously, we're all looking to align with the largest strategic aims at universities that we're working in. This particular project appeals very much to the Administration at Columbia. It's a university that cares a lot about First Amendment rights. You can see that here. Also, we have a provost that cares deeply about the Freedom of Information Act. So this particular engagement with a faculty member enabled us to align very closely with the university. And finally, as a library, we've always been committed to government information and to government documents. We're a federal depository library. We have always had a librarian dedicated to this work in our government information library in doing this kind of work. So it seemed natural that we began to think about what is government information in the 21st century. Something that Professor Connolly really resonated with us was his really call to historians in training that you're not using paper archives. You are to a certain degree, but there are new types of electronic archives that are being brought to bear on the kind of work that historians are doing. And we were very interested in all of that work. So why wouldn't we get involved in a project like this? So like many of you who are dealing with a sustainable challenges in technology infrastructure, preservation storage, data management challenges, I've been dealing with dozens of legacy projects that over time have led to data silos, security issues, compatibility issues, performance issues, political challenges. While many of these projects are created internally, something I call a self-inflicted wound, some of them come from our faculty-driven projects that for some reason or another eventually land in the libraries. Many of these projects share common characteristics. First, they lack clear ownership, governance, and funding. They might have had funding initially. They might have been grant-driven. They might have had a strong owner by the faculty member. But eventually they wind up becoming the responsibility of somebody outside of the original owner. That typically is the library organization. Sometimes it's the IT organization. And in my case, it's the library IT organization. So second, due to their age or implementation method, they typically have some level of technical issues. And the IT organization in charge may need to divert scarce staffing resources or other resources to keep these systems running, to keep them from losing data or becoming a security issue. Sound familiar? Third, there is typically some external pressure to keep these systems running without sufficient financial support. And as a result, this burden also falls upon the library and the library IT organization. These projects contribute to the technical debt of our organization. They affect our ability to complete existing projects and to support the strategic efforts that our organization is setting forth for the organization to use technology to reshape our libraries to better serve our faculty and our students and our broader community. As a result, I've become a bit more sensitive about making upfront commitments that have long-term consequences for the libraries. And let's just take a look at one example. Digital Dante was a project that was created by a student 22 years ago working with an Italian professor. And it was a fabulous project for its time. And over time, it basically ran fallow. It was patched and maybe a little bit of electric tape and Elmer's glue by our teaching and learning technology folks to keep it running. And eventually, thump, it landed between Barbara and my respective areas. Luckily for us, we had the support of the chair, that faculty member now was the chair of the Italian department. And we undertook a two-year effort with the Italian department, with librarians in Barbara's team, with metadata specialists, with the technology folks. And we were able to renovate and restore this project and it's now available. So this is a success story, but not without its cost. In doing this, unfunded mandate with no additional resources, no strategic alignment to what the university wants us to do, we diverted resources that have an impact on us to solve all the problems. So while it's a success, it was not without its cost. Now maybe it might create an opportunity first in the future, but that's the reality. That's just one project. There are probably 50 to 75 behind that that don't have the good fortune of a chair of the Italian department providing that support. So when I learned about this project from Barbara and Professor Connolly, my first reaction was perhaps typical of an administrator responsible for technology-related services and the infrastructure in the staffing that support them. Is this a project that will remain a faculty-driven project? Or should we consider this as a project that might one day find its home within the libraries or one of our partnership organizations? Is this project something that should be developed by a single institution or should we develop it aggressively as a consortium effort? Given that this is currently a custom programming, bespoke development effort, my secondary thoughts moved on to Blacklight, Hydra, Blacklight, Fedora, and whether these platforms might be relevant to these efforts. That sounds great if you're from a library perspective, but I was also concerned about the balance between supporting a stable system and the need to innovate and experiment to support the type of scholarship and experimentation that Professor Connolly is doing to advance the aims of his scholarship and the aims of this overall project. So I looked at three challenges. One, scale. How much data is stored today? What is expected to grow over time? How complex is this information? What type of metadata is used to manage it? The nuts and bolts. From a software design perspective, how is it designed? Are there approaches that might help this project as it grows from a project to a sustainable service? Can we leverage any of the existing open source efforts that we have? How can we reduce open costs that we might have to put in place now? How might we reduce our local development efforts now and in the future? And how can we best help Professor Connolly in figuring this out? From a sustainability standpoint, what will it take to run this system over time? What financial commitment is required to run it and provide resources for innovation? If the project evolves to be a consortium effort, what type of governance is going back to that slide? When I put these up, this is not prescriptive of what we should do for the Freedom of Information Archive, but we see here there are consortial efforts around service development, consortial efforts around content development and management, and consortial efforts that really are about relationships. Any one of those three, or all three of those could be formative for Professor Connolly's efforts. So what I did to support begrudgingly Barbara and Professor Connolly is assign some of my team members I joked begrudgingly to assign some of my team members to work with Professor Connolly's development staff to help us better understand the technical underpinnings, the data storages, models and the software design and maybe even some of the data structures. And we use this information as input for the next stage of the process. Let me just say first of all that I'm one of your clueless faculty who just expects the books to be there and the search engines to work. Before I started this project I had no idea the kind of effort that goes into that and the kind of tough choices that Rob and Nancy and Barbara face every day when people like me come to them and want even more. So now I'm going to speak to what I've learned along the way and the ways in which I've tried to address some of those kinds of concerns with a lot of help as you'll see and also how it is I still think that there are real opportunities here. Not just for Columbia but I hope you'll see how it is that other institutions might see ways in which we might partner in this space and begin working together. Because I really do think the issues of government accountability and how it is that we preserve a historical record, especially the record of our time is going to become even more important in the future and ever more difficult alas, all the more reason we should work together on them. So first of all I often say and it's true that we are building the world's largest database of declassified documents and what we're trying to do is not be done, there's wonderful work being done at the National Security Archive for instance, Document Cloud has many terrific they have enormous numbers of declassified documents as well but what we're trying to do is build an archive made for the age of electronic records and build it that way from the ground up. And there are tremendous numbers of not just digitized but born digital declassified documents that are becoming available now and the agency has to maintain a FOIA reading room and so those FOIA reading rooms represent an enormous government commitment and the government every year spends about $500 million complying with FOIA in part by creating these electronic reading rooms to make those kinds of documents available to all of us. So what we have so far anyway is a piece of it but it's sizable and it's growing as we have so far like the Foreign Relations United States from the State Department also, though that also includes the records of every other agency and department that's involved in U.S. foreign relations like Treasury the Department of Defense, the CIA and so on the State Department's own Central Foreign Policy Files and you'll see a few of the other collections here as well and it's growing even as we speak so we're now in the process of acquiring another half a million records from the Central Foreign Policy Files the Foreign Relations United States is being produced every year by active Congress the State Department is obliged to create the official record of American foreign relations so now they have another they still call them volumes and they still print them but more and more these are volumes that are being uploaded to GitHub and so now there are 70 more of them in preparation which amounts to about 30,000 original documents the Central Foreign Policy Files as I mentioned each year it's another half a million so born digital records the State Department began keeping electronic records in 1973 and when they first did an appraisal in 2006 they found that there were already some 26 million records that accumulated by that point but one thing to keep in mind as I point out here is that these are textual records for the most part like all the Central Foreign Policy Files are textual records so all of the data together amounts to less than 100 gigs so it's not actually a huge amount of data even though unfortunately the National Archives continues to throw out a lot of this data they simply don't have the resources to process them and provide them to researchers in the way they would like but what we can preserve is a lot and what we are able to preserve then helps us to figure out what it is we're missing what it is we're missing and I'll show you in just a moment so this is something that I began working on I was passionate about it I was working with a statistics professor named David Madigan though and I had to because a lot of the data science work we wanted to do I couldn't do on my own we worked with a lot of students and professors but especially students in computer science and over time we've been able to create this database a structured database with consistent metadata across these different collections such as to create a resource that will be of interest to multiple disciplines and so the National Science Foundation doesn't actually fund history only the history of science so when we put in an application we had to show how it is that this is a resource potentially for sociologists for political scientists and so on just one of those that I showed you the foreign relations United States if you look at JSTOR there are more than 2,000 articles in the political science literature referencing foreign relations United States so this is not just a repository for historians it's one that could be of interest to other disciplines and as I'll show you in a moment that's not just in the social sciences and the humanities but also data scientists themselves who are interested in this kind of structured data and want to work with it and are beginning to work with it now you may ask yourself especially if one of us eventually comes to you asking for your help you may ask yourself why should we fund something where a lot of this is already online by definition what we're getting are things that are either digitized or born digital well I could ask you why is it that we don't just dump all the books out in a big pile why is it that we create catalogs why is it that we create means of understanding the nature and content of these collections so that we can not just find the one thing we were looking for because we already had the title but find the book to the right and left and to create the kind of serendipity we have normally in a research experience that's the experience that's lacking when the only means of entry is through the search engine and unfortunately more and more these collections all you get is the search engine and the search engine is just not good enough for the kind of research that faculty want to do more and more of us even the historians want direct access to the data you know instead of having to go through this tiny keyhole that you get when you have a search engine and it's been proven from research that people do in the e-discovery field you know when lawyers have to deal with really large corpora that when people are using keyword searching to try to find responsive documents for e-discovery for litigation they only find about 20% of the relevant records because the fact is we don't know what the keywords are and in many cases the things that we want are the ones that don't have that one keyword we were looking for now the collections I was talking about like this one for instance this is the search interface for the central foreign policy files what you have is a single search box and you can do boolean searching you can do things like that but if you want to do something that takes advantage of the tremendously rich metadata in these collections the central foreign policy files have 68 different fields of metadata like the classification level what countries were concerned what were the subjects of these records even who reviewed them for declassification so the review history and declassification if you want to do any kind of searching involving that incredibly rich metadata you have to search 28 different times because the only way you can do this kind of field of search is through an interface that only allows you to search one piece of it at a time the same story over at the State Department with the Foreign Relations United States Collection this is a tremendous collection it's now about 500 volumes it's more than 200,000 documents going all the way back to the Civil War and the only way you can get access to that incredible collection is through the search engine you know it's as if you walked into the Library of Alexandria and said I want all the scrolls that have the word Cleopatra in them but not Antony I mean who would do that it's kind of a crazy way to do research so I think what more and more of us want is to find a way of reproducing the experience of doing work in a proper library or proper archive with finding aids in some sense when you find something of the context and significance of what it is that you found so that's what we're trying to do with this platform and what I would argue is that yes it's true a lot of this stuff is online and when you go look at any one of them and all you have is that tiny keyhole and you have to use it 28 times just to search through one collection it's really difficult to find what you're looking for or to know what it is that you found when you find it so what we're doing with our interface is allowing you to search through so far these five different collections with more on the way allowing you to begin with just to find the secret stuff that's what you want and so what we mean by entity is we're leveraging the research done in data science called named entity recognition where you can search through a corpus and extract the names of people, places organizations so you can use tools not just off the shelf tools but unfortunately you have to adapt them for using historical corpora but when you do that you begin to extract all the names of countries in these corpora like in this case I always mention in the foreign relations United States you can extract the names of people you can begin to see what's in that collection which is what almost everyone wants to know when they're confronting something of that size they want to know what's in there what am I going to find if I start looking for it and so you can begin combining these entities whether classification level the names of countries, the names of peoples and yes you could do full word searching or full text searching as well this is an example of a tool in data science a technique called topic modeling and topic modeling is a probabilistic way in which you can cluster documents where you have the same kinds of words appearing in the same context so in this case when you run this algorithm it will produce topics and the most common topic is one with the words Israeli, Palestinian, oil, Egypt, PLO, etc and so when you read those documents and it will rank order them you can tell that this is a document specifically about Arab Israeli relations they use the same kind of technique to characterize a corpus like all the articles published in science or nature and they use it to identify within that corpus which are the articles that are about genetics which are the ones about data science and they can use this to begin to see when does a topic of genomics begin to emerge in the life sciences and then they can trace kind of the rise and fall so we can do the same thing with other topics to make sure to what is otherwise an unstructured collection to begin to again reproduce what it is that all of us have come to hope and expect to find when we go to an archive that is to begin with to have a finding aid to have some sense of what's inside now another tool we have is one that we borrowed with their permission but from this tool that we borrowed from the people who created the Google Books and Graham Viewer but we're able to use it now and you can search different archives and so you can find for instance in this case the rise and fall of human rights right and so there are historians who do this kind of work manually there's a man named Sam Moyne who wrote a very well known controversial book called The Last Utopia about the history of human rights who argue that as much as people think that the idea of human rights came from the UN Charter and the Holocaust it actually was only in the late 60s and early 70s that human rights he had to manually look up within several different newspapers you know the numbers of times that somebody used that term human rights and historians will do that kind of work because they know that for the history of ideas sometimes it's helpful just to know where those ideas come from and how they spread now you can do that at scale you can do it with collections enormous collections and you can begin to do multi-archival research by comparing whole archives one against another all the tools on our website are there not just to produce pretty graphs and whatnot you want to know what is the underlying nature of the data itself so it allows you to find the documents that produce the data and once you in this case begin to click on the ones that produced that little burst in March of 1977 it not only retrieves that document that was the first one that helped produce that data point but also it will allow you to see what other topics are represented within that document and you can pivot from there to explore other related topics topics related to the history of human rights in this case Portugal's former colonies relations with Ethiopia etc and also on the lower right there's a tool called Merriam which is one we're using in collaboration with an e-discovery firm where it's automatically generating similar documents and you can calibrate or you can adjust the similarity like an old graphic equalizer or the similarity in language or the similarity in countries or the similarity in people or the similarity in time so you can find what was passing across the desk the same day that that particular document you were looking at so we're trying to reproduce the experience of archival research the way it was meant to be as best we can with the tools we have from data science so the last thing I want to address Rob's concern about sustainability and normally we just think about sustainability in terms of money like where's the money going to come from but what I would like to suggest is that sustainability ultimately whether it's money or talent you know the talented people you have to recruit to work on this stuff it comes from the intrinsic importance of what it is you're working on the material itself isn't important enough well I've been trying to show whether it's through you know acts of congress you know or whether it's through the other ways in which the government identifies or the people as a whole identify the things that we have to care about and preserve I would argue that the archive you know of America and the world is something that we all have to care about you know now as much as ever so that it's not just the historians so the data that we're collecting is now being used by computer scientists whether at Princeton or at Microsoft in this case in creating a tool to automatically detect historical events it's something also that a professor at MIT working in statistics found useful for developing new models to find burstiness to do traffic analysis the same kind of research the NSA is doing but now we're able to do on American declassified documents this is something also of interest to the general public right so Buzzfeed will write about a feature about research like this when they think that it's something that many people will find interesting and then finally and most importantly I think this is the kind of research this is the kind of collection that's going to begin to help us understand what it is we're missing what it is that's not there what's not being preserved because it's being destroyed and what's not being shown to us because it's still classified it's still being kept secret so with this kind of data with these kinds of tools you can begin to make out the dark matter in this universe you can begin to see what it is the government is classifying and what it is it's more likely that you're not likely to see a record and so this is some research that I wrote about recently with one of my students for the Washington Post and applying these kinds of algorithms to the problem of a machine classification of state secrets and this is a case to show how much human error there is in the way that we identify things as secret or not secret but however you count it and this is from government statistics the information security oversight office produces statistics every year on how many secrets the government is generating there's enormous numbers of not just documents anymore of course but more and more electronic records that's where you see the inflection point when they begin to count things like email and text messages and so on you see enormous amounts of classified data it's growing all the time but the amount of documents that are being reviewed and released to the public is declining and in fact since the late 1990s there's been a collapse basically and now we're still even in 2015 there's only about 30 million pages of documents that are being released every year because it's basically and here we do come back to money it's really about the money the nation's budget for keeping official secrets is over $15 billion so this graph is from back in or it was only up until physical 2011 we're now at $15 billion the government is spending on keeping secrets right the things that they think are too dangerous for the rest of us to look at take a look at that tiny sliver at the bottom there was a little blip in the late 1990s since then adjusted for inflation the amount of money the government is spending on declassification is 15% what it was in the late 1990s it is less than 1% of what the government spends on guarding state secrets so that's the government's priority that should not be the way that we understand what our resources should be paying for what our priorities ought to be especially in this time so these are the arguments that I would make about why it is this is something that people should care about and I'm hoping anyway that we're going to find partners maybe even partners in this room so I'll turn things over to Nancy this slide shows oh yeah sorry I'll tell you really quickly I don't have much time for questions here but I'm actually not with Columbia I'm the hired guy I'm the consultant that Columbia has invited in my firm is called Blue Sky to Blueprint and I love projects like this because I think the trick is figuring out as Matt mentioned it's a financial question at the end of the day because some folks will have to get paid to do some work but it's a lot of other dynamics that involve as well in terms of making this something that a lot of people buy into to use to participate in and so forth in September, October I interviewed some stakeholders at Columbia and deans of libraries around the country just a handful and faculty mostly in history and some social sciences to get a sense of just from looking at it they took a glance at the project just as it had been launched they read a two pager and then they started to give some first impressions about what it could be and where it could go so obviously everyone loved it but it was a little bit like the blind men describing the elephant someone said oh this is great this would be great if it had I should go for a huge amount of content someone said oh this would be great if I could download the data sets I never want to use the website and it went on like that people had a tremendous variety of responses not so helpful but important to know important to know that part of our task is shaping there was a big disciplinary divide and how people used it Matt suggested historians and social scientists or those in the humanities and social sciences disciplinary does your question involve needing to read a document or does your question involve needing to churn through data looking for patterns and the split was very stark different people doing it different ways wanted the information in different ways historians doing very sophisticated work didn't necessarily want to build their own analysis tools they very much wanted to use it on a site they just wanted to get to the material folks more comfortable with computation said I'm sure your tools are nice I'm going to write my own and that split was very clear the other big message was a little tension here you have to go big this has to be huge the numbers look big but what part of the hole does that represent and how quickly could you get to the hole and how can you talk about what that hole is but don't try to do everything right so how do we figure what that's going to look like and then finally everyone is very impressed by the work especially since it's been done really by a skeleton crew but here's that point how do you take it from something that's identified with a person at the university and have the entire community whatever however we choose to define that buy in so that everyone feels real ownership of this so we can go to the next slide I'll just kind of nuance what some of these issues are and then we'll open it up so that content roadmap is tricky there was a 25 agencies were creating these materials they get classified and then get declassified and there's different timeframes and there's different processes for the declassification and then maybe some stuff is just more interesting than other stuff so how do we figure out what that's going to look like and how quickly to get there there were a lot of different ideas on this but actually each different scholar I spoke with has their own personal project what was an interesting idea I just launched out there was possibly there to become some kind of a community of people who helped to influence not just what gets put here but possibly what the government declassifies all of a sudden that started to sound like an area we might look into some more overall this next phase and finally has to involve some degree of really sharpening out how we talk about this and Matt does such a great job of talking about what the richness of it is but again how can we talk about what the future of it looks like are we going to aim for real innovation in the research tools or are we going to aim for real growth and a certain path on the content what's going to be the investment if people do come in how do we encourage them to support the strategy that the team will devise and then finally how do you actually get people involved that's like the age old question there seems to be an inherent interest and fascination but this is going to be a real work of outreach which elements will respond to people how broad is that disciplinary uptake going to be how do you explain to a scholar of English that actually there's probably stuff in there that they're interested in too depending on who they're studying so there's going to be a lot of things around that and finally funding model at the end of the day it's going to actually cost something to do this no matter how clever the cost avoidance strategies are there's going to be something that's going to have to be covered especially as these guys know very very well so I will also just suggest just because it made me chuckle literally everyone I spoke with I would just say so how do you think feels like the right kind of a model for something like this I heard everything from advertising corporate sponsorship to like other more traditional things like partnerships and building a consortium I've never had a project where literally I could check a box and every single box was checked and people felt couldn't possibly work so we've got a lot of work to do at this point I'll open it up to you would love to hear what pieces of this really respond and really resonate with you also if you want to check it out I don't know if the connection here is any good but if you have a decent internet connection it's at history-lab.org