 Alright, thanks everybody for coming to our session today. We're going to be talking about the data capsule appliance for research analysis of restricted and sensitive data and academic libraries grant that we have from the Institute for Museum and Library Services. My name is Robert McDonald. I'm from Indiana University. Talking after me with a brief intro of the data capsule appliance will be Ina Cooper who's also from Indiana University and head of our Data to Insights Center. Ina will be Eric Mitchell from the University of California, Berkeley and then we'll wrap up with John Unsworth from the University of Virginia and what we'll be talking a bit about today is how we've been trying to repurpose our data capsule into more of an appliance for use by archives and libraries with born digital content. I'll give you the brief overview here. As you'll see from this slide, these are three of the areas that the Hathi Stress Research Center works in now with the Hathi Stress content. We have our extracted feature sets which have been a big hit with lots of DH types of users because they can take those feature sets with them and do their own analysis with them. We have our ACRC portal which has some built-in analysis tools that you can use against the collection of the Hathi Stress Digital Library and then we have our data capsule here to your far right. A couple years ago we had a pre-CNI seminar that was talking about this type of analysis of born digital content from libraries and archives that we were happy to have a lot of different folks around the table using lots of different types of technologies to build different types of data enclaves, some of them from health sciences, some of them like Hathi Trust or more around textual content, some of them were working with other types of things like video, but since that timeframe we've been looking to see how the data capsule, since we've got it into a production mode with the Hathi Stress Research Center, could be used for the libraries and archives to be able to create their own data enclave for the restricted types of born digital materials so that their curators and archivists could work with it directly but have it in a secure environment. Now if you've used our data capsule you probably understand a little bit about how it works but this diagram just kind of runs through overall what happens in when it's in place for the Hathi Stress Research Center and with this you end up with the snapshot of the VM. It is in kind of a piece to be set up by the researcher and then once they get the virtual machine set up for the kind of tools they want to use for their analysis they switch it into secure mode and that's when they're able to get access to the copyrighted text and the volume that become available to them. Once they run their research all of the components that they do analysis for have to be vetted before they're released back to them so that gives you an idea just walking through this going between the maintenance mode and the secure mode to get access to the data for Hathi Trust. So the idea here is to take the data capsule and put it into place at several libraries that are looking at using this with born digital collections so that they can allow their researchers and curators to get access to that collection in a secure environment. And with that the other big piece that I did want to mention is the containerization of what we're doing with the data capsule by way of our recent 4.0 release and that's what's helping us get there with this project in terms of being able to install it in additional environments like those at UC Berkeley and UVA. And with that I will turn it over to Ina who will talk a little bit more about the goals of the project and the project to Liberals. So as Robert already mentioned the assumption that we're building on is that besides Hathi Trust digital library there are a lot of different collections and libraries are working with and they digital collections and they have rich possibilities of working in a computational mode and doing computational kind of analysis but there are different kinds of restrictions that come with some of the collections and it could be copyright similar to Hathi Trust but it could be other types of sensitivities you know confidential information, private information, personal information, those kinds of things. So extending data capsule that is already an existing technology and it's been proven to work well with Hathi Trust Digital Library extending beyond Hathi Trust and understanding what are the needs of the libraries that work with collections computationally but collections that have restrictions. So that's one of our goals and then so extending the service and the technology to enable access to restricted data as I said but also we're looking at trying to understand the skills that are needed to provide such a service access to restricted collections and what is there and what are the gaps in such a skill set in libraries. The way we're doing that I think that's one of the interesting parts of the project is that we're working with several partners in libraries and we're working around the framework of participatory design where it's different from when developers come and they collect requirements and then they build a tool and then we see if it's useful or not. We start the partnership right from the beginning and work together in this library technology kind of collaboration where we right from the beginning we discussed what it means to provide restricted access to collections and what it means to provide computational access to restricted collections and then what it means to extend data capsule. So we have partners at different levels and we will list them at the end that have different modes of engagement but working through the first steps of data capsule installation and then understanding the difficulties that that's the teamwork that we're doing at Indiana University and along with partners at libraries but also the second main piece is the software architecture and you know changing the data capsule itself modifying it. So the deliverables because we have this dual nature of participatory design and software architecture we're trying to understand on one hand gained the knowledge of restricted collections and their policies and context of use but then at the same time packaging data capsule as an appliance and making it useful for using in different collection types. Meanwhile understanding these complex library technology research collaborations that happen around such complex projects and we hope that you know through we have some community building exercises built into the project so we hope that one of the deliverables would be an emerging sense of community coming out of this work. Great so our use case involved video and I picked video because it's not text and this has been used with text exclusively so far. I picked video because we actually have a large collection of video from a Virginia television station which is all in copyright WSLS and I picked that collection because there are actually happy to have a share that so the risk is low but video is a great example of content which will almost certainly be under copyright and you know have restrictions on its use. So our particular use case is low risk but we're dealing with a kind of content that is both generally speaking high risk in terms of the legal restrictions and also computationally challenging so if you want to process video it takes large amounts of computational power to do complicated things with large amounts of it and what we're doing with the video is running different kinds of captioning software on that. The slide says new urgency now that Apple has bought the pop-up archive captioning service I don't know how closely you all follow this but pop-up archive was actually what we were using to do our captioning and they shut their doors on Thanksgiving with very little warning to their user committee after having been bought up by Apple which is concerning but there are other packages out there that we can use and we're now trying to see which ones work better for which kind which parts of the problem. We are particularly interested in trying to create a push button installation package that you can hand to a library and say okay you know run this and you will end up with a data capsule. There are definitely people at Berkeley who are working on parts of that problem and people at Indiana and people at Virginia so that's something that we all have our hands in and not surprisingly in the course of trying to do that we're finding all the ways that the data capsule has been hardwired to the HRC environment or to the you know data structures that it expects or the file storage structures that it effect expects and so on so that's great that's what we're supposed to find in this project and we're finding it so we're trying to disentangle the data capsule from its native environment and make it easier to export and share in other environments and we're trying to look at how it might work in other kinds of library projects so some examples from UVA you know that give you context for why we're interested in this problem we've been working for about five years now to try to enable computational access to digital collections at the camp at the Packard campus of the Library of Congress so audio and video and basically open a research terminal or terminals at Virginia that give researchers computational access into that collection I will say the challenges have been more bureaucratic than technical in that process but gradually into fatigued belief we have made progress and I believe one day this will actually happen we're also we've done some work with Ithaca and particularly with Kate Wittenberg at Portico to sort of survey the landscape for text data mining and try to understand what's the perceived need for that among librarians scholars and publishers that's been some interesting work and we're I know I connect that to this effort because in my mind at least understanding the emerging market for text data mining and libraries is part of what we might do with the data capsule work if we had a data capsule that it was easy to hand to somebody and was easy for them to set up it would change the kind of conversations and considerations you'd have around this right now you know I'll just say libraries generally see themselves as an intermediary between faculty and publishers when it comes to procuring data sets but what they mostly do at this point when they get them is hand them to faculty and say good luck libraries don't provide a lot of infrastructure for this and you know they don't really have a plan for that in most places so I think if the data capsule were part of that conversation it might change what the library saw as they're on the ground role in facilitating research with maybe remote collections that are held by publishers and finally you I have been working for a year or so a little bit more with Portico and J Store to see if we can't figure out how to do text mining across distributed collections copyrighted material you might have heard a previous presentation about this or conversation originally we're thinking about co-locating material that raises so many issues from a legal contractual point of view that it will probably never happened and you know so having a data capsule that you can point at different restricted collections and maybe unify your your results at the end in some meaningful way would be an important way out so I have lots of reasons for being interested in this project so on the UC Berkeley side actually you know as you're talking I'm picking up on maybe an educational use case we haven't thought of and I'll maybe I'll get back to that in a moment but on the UC Berkeley side we've talked a lot about how to provide compute environments for people doing kind of data focused research particularly in conjunction with the more Sloan data science environment programs as a data science institute in the library and it's part of that we've hired a postdoc fellow who's been studying software curation and so earlier this year I guess kind of early in 2017 she worked with John Borgie so Yasmin on nominee is our postdoc to worded John Borgie to study the software curation and data sharing practices of data scientists kind of broadly and this was one of their somewhat surprising findings is that sure there are cases where people can't share data but it wasn't for lack of community interest they actually ran into sensitive data issues and I think this this project is interesting to us because it at least gets us one further step down the road of enabling reproducibility on sensitive data I'm not sure actually how you get all the way down the road of you put the data capsule out in the wild and then somehow federal access to it and all that stuff but the notion of being able to provide a secure compute environment that is matches the restrictions that data might come with is certainly really interesting I'm sure these slides are available they they talked about their study much more depth than I'm going to do here and so there's a good look to it there and actually Alex Chassanoff and I forget who Thornton's first name is the postdocs actually studied this somewhat broadly so that's a really compelling presentation kind of in conjunction with this some of my own research in partnership with Heather Mulays on Sandy and Edward Corrado studied data sharing practices within the library and information science field and I mentioned this only to say that we kind of found the same issues as the data privacy wound up being kind of a key barrier although as we presented our findings and acists we got into a really interesting discussion about whether data is even reusable given or at least human subjects data is reusable given IRB restrictions so that kind of exploded the topic a little bit more but there was one one response here and this is the one the one under the first green line says desire to preserve ownership rights that I thought was a really interesting edge use case case for something like this I'm not sure that as a librarian respecting that desire would be the first thing on my to-do list but it was kind of an interesting kind of the researcher who responded here talked a lot about the work they had put into gathering that data in the interest they had and holding on to some sort of rights to it and again this environment would be a way technically for them to enable others to access their data while maintaining that those rights not ensure that entirely use it and so John to the education use case you know strikes me that at least at Berkeley we have a growing undergraduate data science education program in the libraries talked a lot with the people teaching courses there about how to use licensed data sets in their curriculum and I wonder if you know I haven't even thought this through a whole lot but the this data capsule appliance actually is kind of a risk mitigation tool that you might use say if you're going to try to make a license to secure data available to 15,000 undergraduates for example where the sort of one-to-one researcher trusts that we use when we license data isn't a realistic expectation and so where we hope with our outcome I think on the next slide that we get to is that this data capsule environment becomes a new secure data service that we offer campus-wide so across the campus we've got a great high-performance computing environment called Savio that our research IT group offers we've got a streaming desktops which is basically the open version of the data capsule through a service called analytic environment on demand as well as streaming applications through Citrix we actually use these cases in a few environments to provide access to restricted data where the researcher can have full control there's some other examples across campus where we built cold rooms where we maintain computers of different sizes with different sorts of access restrictions where the researchers don't have full control over the data and I know there's from we the library doesn't maintain any of those but I know from talking to our colleagues that those require a lot of staffing a lot of kind of just-in-case equipment purchasing and a lot of policies around in fact one of the most recent problems that one of these groups ran into is that campus IT was trying to provide support for the desktop computers in them and kept violating all the configuration policies on the machines because they weren't they weren't well-versed right and what a cold room was and so in an ideal case this becomes a push button deployment it works seamlessly on all of our high performance infrastructure of Berkeley and we get to a point where researchers actually could make use of these data capsules in a pretty seamless way and kind of up next one of the things we looked at a little bit or some of the things we're finding with the with the trials at UVA in Berkeley and a lot of it you know Cliff hinted at a little bit in his opening today and it's around the use of these these kind of containerized components and how you roll them out in standardized types of environments so that they're you know fast to put out there and easy to use but then for us anyway and I don't know if members of the panel want to dig a little deeper but this whole issue of being able to mine across different data sets the policy issue is one of the bigger and thornier kinds of issues because with IU's own use case for this it'll be political papers that are born digital and of course people are going to want to be able to mine across different segments of those usually by how they're set up in the archival structure well one of the bullets that I contributed to that slide had to do with security requirements and you know particularly given the purpose of the data capsule which is to provide secure computation with sensitive data of some kind that seems like the part of the data capsule that's going to need to evolve and continue to evolve the fastest and you know we'll need to be but this is an area where I think containerization actually buys us a lot because we don't have to implement those fixes on eight different operating systems you know we can do it once and put it out there because you know Robert and Nina you could probably speak to technically how you'd solve the non-co-located data issue yeah I was in the session just ahead of this one where they were talking about a national data repository archive for Canada and I was wondering how like how you would attach this sort of environment to that you know where they would sink their 30 terabyte data set into a research platform or I don't know if you've got any words of wisdom about how you'd actually technically solve that problem or not we barely started a discussion on this and I think you know we don't know yet that's why it's research that's right well a lot of people you know are kind of moving toward the use of globus and that was what the last panel talked a little bit about and that does help you when you can move your data around in the HTRC implementation we're not moving that data around there's the one copy that's used for the mining and then there's a compute that's in place near the data so you don't have to move it around and that's I mean that's a key factor but thinking about what you would use for that kind of analysis in your own you know what kind of compute power you would use yourself if you were to build one of these this is another issue that's come up in the big ten anyway with the new Clarivate contract folks will be able to get most of that citation data to be mine but right now there are very few enclaves set up to actually do anything with it one of those is at a research center at IU and they've been running it for a while because there's a lot of research interest but there's a whole lot of data cleaning that goes in to cleaning up that package to make it usable for that kind of ongoing we had a few other discussion topics just to kind of throw out to the audience to see you know if there were other use cases that you guys are facing now and other kinds of licensing components that are coming up in your own work that might be of interest to this group