 This is Joan Lippincott, Associate Executive Director of Merida from CNI, and I'm so delighted to have you joining us for a CNI spring membership meeting webinar on sensitive and protected data in distributed digital preservation networks. I am the last planning grant brief. This is a really important topic for many libraries and for many universities, and we're delighted to have this collaborative project represented in our lineup of sessions. This project includes speakers from the partner organizations, and they are Courtney Mooma and Christie Park of the Texas Digital Library and Sybil Schaefer of the UC San Diego Library. So today we're going to have all of the participants microphones muted, but as the presenters go through the presentation, please feel free to write in your questions or comments into the Q&A. So we can use the chat box as well, but we're primarily going to be looking to the Q&A, which you'll see in an icon on your screen and you can type in a question at any time. I'll be monitoring that box and we'll be feeding the presenters your questions at the end of the half hour presentation. We're going to be putting any additional information into the chat if something comes up like the URL. We will be sharing the URL for the slides when they're available and this session will be recorded and you'll have access to the recording after this session. So I think that should get us going and so I'm going to hand the session over to the panelists and mute myself and over to you. Thank you. Thank you, Joan, and thanks to everybody who is attending for taking time out to hear an update from us about sensitive and protected data and distributed digital preservation networks. This is Christy Park, and I'm the Executive Director of the Texas Digital Library, one of the partners on this project. I'm joined by Sybil Schaefer, who is the Digital Preservation Analyst for Research Data Curation at UC San Diego Library and the Chronopolis Program Manager. And by Courtney Mooma, my colleague, who's the Deputy Director of the Texas Digital Library. Sybil and UC San Diego Library have partnered on this project, which was funded through an IMLS planning grant to explore a nationwide model for a DDP service that would close gaps in current preservation offerings for sensitive and protected data. Our briefing today will give an overview of that investigation to date, describing the problem we're trying to solve and our activities up to this point, including some use cases that we've gathered and analysis of the legal agreements governing such a service, as well as necessary elements we've identified for a DDP network for private and sensitive data. TDL and UCSD have both established business models for building and providing distributed digital preservation services, as well as a history of collaborating with one another on services like these. UCSD library of course manages Chronopolis, which is an internationally recognized DDP service that spans three sites across the US, and it's one of the earliest established distributed digital preservation services in the world having been in operation now for more than a dozen years. TDL, which is a consortium of academic libraries providing a number of services supporting digital library work, has offered access to distributed digital preservation storage since 2015 using its own hosted instance of DuraCloud. Our history with UCSD goes back to work that we did together on the deep end project, but in 2017 TDL also joined the Chronopolis network in order to provide access to Chronopolis storage for our members and to become a replicating node for that network. We do that using storage at the Texas Advanced Computing Center. So we've been excited to be able to continue our collaboration and advancement across these two institutions, the advancement of distributed digital preservation services beyond the wind down of deep end. Beyond the main project partners and our funder, the Institute for Museum and Library Services, we've been grateful for the support and participation of a large number of institutions who've partnered with us on this project. And I'll tell you a little bit about them. One of the reasons that we decided to work on this project in particular, beyond the identification of the need for such a thing is that both TDL and UCSD maintain close working relationships with organizations affiliated with their campuses that could provide key resources for a DDP service for private and sensitive data. The Texas Advanced Computing Center or TAC at UT Austin has partnered with TDL on a number of projects and we currently use storage at TAC for our Chronopolis node. TAC independently offers secure HIPAA and FERPA compliant storage to local partners and similarly the San Diego Super Computing Center at UCSD provides HIPAA compliant storage to faculty and researchers there. Both computing centers have shared with us their expertise in providing sensitive data solutions, including service and cost modeling. But we also, you know, I think at the back of our minds would hope that any service we can build would take full use of the resources available to us through these two institutions. Additionally, throughout the project we've benefited from consultation with a large number of representatives, including representatives from the AP Trust, the Smithsonian Institute, Northeastern University, UNT Health Science Center, UT Southwestern Medical Center and the Dell Medical School at UT Austin, DuraCloud at its home institution, Lyrisis, and the Maryland Advanced Research Computing Center at Johns Hopkins. These partners have, among other things, helped us build use cases for a private and sensitive data distributed storage network and helped surface needs and challenges related to such a service. And we have also partnered with Security Metrics, which is a expert consulting firm in HIPAA compliance to provide templates for policy creation, and to review various legal agreements for us, and the final project deliverables. So, with that, I'm going to turn it over to Sybil to start digging into the project itself. Hi everyone. So for those of you who are not familiar with the term distributed digital preservation, I'd like to take this opportunity to define it since we'll be referring to it throughout the presentation. In 2010, the Metaarchive Cooperative published a guide to distributed digital preservation, and offered this definition. Distributed digital preservation methodologies hold that any responsible preservation system must distribute copies of digital files to geographically dispersed locations, and also must preserve, not merely back up the files in these different locations. And this guide from Metaarchive also provides some qualities of the different sites that should also be in place, such as that sites preserving the same content shouldn't be within 75 to 125 mile radius of one another. They should be distributed beyond the typical pathways of natural disasters such as hurricanes, earthquakes, typhoons, tornadoes. They should be distributed across different power grids. And they should be under the control of different system administrators. This is one that I feel is often overlooked. But for security purposes, all the preservation copies should not be accessible by one person or team of people and control and monitoring of each preservation site should ideally be handled locally by that site to ensure that the networks contents aren't subject to one point of human based failure. The content preserved in the disparate sites should be on live media and checked regularly for bit rot and other issues. They should be the content should be replicated at least three times. Lastly, regardless of whatever technical infrastructure ADDP network adopts, the network is going to perform three main tasks, content ingest or harvest, content monitoring and content retrieval with each of these varying across different technical infrastructures. So the problem that we're trying to tackle with this grant is that although distributed digital preservation services have been offered in the United States for well over a decade, there isn't a distributed service offering for sensitive data. And we propose that personally identifiable information or PII and personal personal health information PHI, as well as other sensitive data that's in the custody of libraries, academic health science centers and archives is at an escalated risk of loss because of the lack of service for this. It's also normal practice for archives to refuse any data that contains any PHI or PII regardless of its historical or evidential value, simply because they don't have the means to steward it. So as part of this work this team is assuming that the bar set by HIPAA is sufficiently high to protect many other kinds of non regulated sensitive data. And based on that bar what we're really looking at is what it takes to provide a HIPAA compliant DDP network. So one of the goals of this grant is to investigate the capacity and feasibility of a nationwide model for a DDP service that would close this gap for sensitive data. The technology infrastructure and expertise needed to build a DDP service for sensitive data exist. Just he's already mentioned our partners and the expertise that they have so that part is there, but the connections agreements and processes to pull it together to form a viable service are lacking. So we're really trying to not necessarily build the service yet, but actually set the groundwork for it and plan out what we need for it. We're also interested in using the grant develop deliverables to really assess our capacity to meet the requirements that we outline and to initiate discussions with possible network partners. So we started the grant in September of last year. One of the first steps we took was to hire some and a little bit RGA. We started gathering data at that point conducting interviews for use cases and outlining the agreements that we currently have between institutions. We had a very successful in person partner meeting in Austin in December, and we're now drafting our final report and working with security metrics, the HIPAA compliance experts that Christie mentioned to determine how our current slate of agreements and technical infrastructure would need to be changed in order to be HIPAA compliant. Once we have solid drafts of the report and legal templates, we're planning to disseminate those to our project participants and have a second meeting this time virtually to gather feedback. And we'll start to finalize the report based on that feedback and then eventually publish the report and any agreement templates and cost modeling that we've developed. So as we've progressed through this grant, we've identified a couple of different potential outcomes. The first is the one that we kind of assumed we would end up with and that is a roadmap and recommendations for a DDP solution for this private and sensitive data, ideally new technological and community partnerships, and the potential of further projects or grants is definitely something that we've been discussing. And an alternative outcome is the conclusion that the DD, well it's not alternative it could be done in conjunction with, but is the conclusion that a DDP for private and sensitive data is not feasible at this time. And some of the barriers that we've actually come across as we've done our research is the lack of user readiness cost or technological complexity. I would say that the lack of user readiness and cost are probably the more serious barriers compared to technological complexity. And with that, I am going to turn it over to Courtney to talk about some of these cases that we have uncovered. Thank you, Sybil and Kristi. So, sorry, I lost my, there we go. So far we have collected use cases about private and sensitive data content that should be preserved in a distributed network from several medical libraries, as well as archives and special collections. So far, those that we've documented contain various sensitive materials including human rights records, first person accounts of trauma and commercial commercially restricted data and also email. In each of these use case discussions we typically ask whether institutions know they have sensitive data that justifies a high level of preservation. In some cases we think that institutions suspect that there is, but haven't done the level of assessment or appraisal necessary to quantify the problem sufficiently, as Sybil alluded to earlier. Alternately, in other cases, the assessment of sensitive content has been done and a decision has been explicitly made not to bring it into the custody of the library or archives. Use cases that have demonstrated this behavior give one or more of the following reasons as examples. Limited resources to manage it, unclear authority to manage it properly, and a dearth of a place to put it. So this illustration shows the legal binds that are necessary to sustain the chronophilus digital digital preservation system. As you can see, there are many connections here which require legal agreements. The connections you see include software licenses. So for example, those which are in place for dirt club chronophilus and AWS. There are three contracts service level agreements and MOUs between service providers and storage nodes. So like those that TDL has with TAC and chronophilus has with NCAR. It has some agreements between the service provider and depositor like UCSD and TDL do with their members and their community depositors. There are also agreements between two service providers like those agreements that exist between chronophilus and TDL as well as between lyricist dirt cloud and chronophilus. So the project team recognizes that the complexity of this system creates a barrier to the formation of a DDP network for sensitive data, but there are others. The complexity of our current technological infrastructure and the legal entanglements. We face other barriers to creating a DDP for sensitive data. Clearly, we need to align and enhance our agreements, and we especially need to make sure we have all of the necessary business associate agreements or BAAs in place. Additionally, with sensitive data, there's always a liability concern and the assignment of liability can be perplexing with so many stakeholders. Just to name a few. There are depositors, donors, preservation staff like the repository managers, archivist developers and service providers. There are the people in the data. There are communities represented by the data. And there are also the original creators, the doctor, the researcher, etc. And then finally, there are the administrators like the academic deans and presidents for overall support and approval, the university CIO, the university librarian, I could go on. As archivists and librarians, we are data custodians who decide how and when the owner's decisions are applied based on classification and regulations, but without knowing who owns the data, it can be hard to make good decisions about digital preservation. And how do we properly identify who is the data owner, because that determination will then guide policy about who can manipulate control and ultimately destroy sensitive content. Our HIPAA consultants on the grant contributed to us that in the US ownership of data is largely but not entirely driven by commerce, and they also gave the example to us that in the EU, alternately, it is generally driven by privacy. So for these issues and other reasons I mentioned when discussing our use cases, many institutions are simply not ready for a DDP option for their sensitive data. But once they are ready, we still have a high bar to meet for appropriate service governance. Our assembled guests at our in-person meeting at the end of last year had clear recommendations that governance decisions be collaborative, but grounded in centralized decision making. They asked for diverse representation of institutional types, practitioners and storage node partners, and that our vision, roles and responsibilities be clear. They asked that any network be data and standards driven, as well as responsive to legal fluctuation and jurisdictional differences, and that the data owner maintain control of their data. Our partners at the meeting, many of whom were well acquainted with the failure of the digital preservation network or deep end, emphasized transparent financial reporting, a good succession plan, and open communication at all levels. This illustrates the very basic configuration and requirements of a HIPAA compliant DDP network. Currently, SDSC, the San Diego Supercomputing Center and TAC, are the only compliant nodes we've identified. So we're already at a bit of a disadvantage in terms of being able to form a DDP network with a minimum of three distributed nodes. We are working to overlay our existing DDP network infrastructure with the needs of this much more simplified one and identify the requirements to accommodate a subset of sensitive data without disrupting the regular service or undermining compliance to HIPAA as the gold standard. We are also open to modeling a mostly HIPAA compliant without the expensive audit model, because we've found that so many of our partners aren't actually accepting real HIPAA or FERPA data into their collections. For that use case, we'll redefine HIPAA's covered entity and business associate for our own purposes. The team will continue our work through September this year, intermittently gathering use cases where they pop up, and that's a call, so please reach out to us if any of you have one that you'd like to contribute. We're working on our service model, as I mentioned, which has included looking at gaps between our technical infrastructure in our current systems and the requirements of a sensitive data network. Additionally, this summer we'll be drafting those templates of the agreements that need to be in place, as well as gather as much information as possible about costs and as Sybil mentioned, we're already on that. Our final report will recommend next steps and outline the work that still needs to be done to achieve a successful DDP option for sensitive data. And as I mentioned before, we've only identified two potential nodes, so we'd like some feedback if you'd like to participate. If a willing node doesn't already exist, we would need significant seed funding toward HIPAA compliance activities leading to that third node being generated. So that concludes today's briefing. I want to thank all of our partners as you saw on that slide that Christy presented who have participated so far with us and continue to do so. Please do again reach out to us if you'd like to be involved in the final months of our work. You can find all of our project documentation so far on our wiki for the project, which contains use cases, DDP documentation and agreements, as well as all of our notes from our in person meeting late last year. Finally, I want to thank CNI for your work transitioning the conference to all virtual. We know it must have been a big challenge. So now Sybil, Christy and I welcome your questions. Thank you so much. I know I learned a lot from this presentation. I've been thinking mostly of health sciences data but things like sensitive data from political movements and those kinds of things are also certainly the kinds of materials that archives in universities may be collecting now. We have one question so far. It's from Ray and I apologize if I mispronounce the last name. And the question is, does the scope of this project include sensitive and private faculty research data, for example, faculty DoD research data, other sensitive national grant funded faculty research, such as funded from NSF or NIH. On this page, are there thoughts toward the size of data sets for this type of repository and thoughts towards the frequency of upload download retrieval. Are these envisioned as dark archives or can they be frequently accessed by specified or approved users. Any recommendations towards these areas. Well, will some of this be included in your report. Any of that or you can take a look at the Q&A. Who would like to respond? I can pop in. There were a series of questions there and I can pop in about the access part. That's probably the easiest question to answer because the way the Chronopolis network is set up now is that it is a dark archive. There are also three sites. Only system administrators are actually allowed to access data. So there is very limited access to the data and actually trying to recall the other questions now. Would someone be able, so it's a dark archive, but would a researcher who's been given permission be able to go in and out retrieving different parts of the data set or performing different queries. They could definitely request, make request for data retrieval. Yes, queries, not so much as more. It's not really a working area. It's when the data is finalized, it is preserved in the different notes. Ray, I hope that answered your questions. If not, maybe you type up something additional in the Q&A. There were another additional questions within that particular question. This is Christie. The other questions were around, have we thought about scale, size of data sets? Yes. And would we take in faculty research data? I think the answer to whether we would take in research data is potentially, yes. Do you or Courtney have thoughts about scale? I'll take both, actually. So the first question about the research data, I agree potentially, yes. But I think with the caveat of what Sybil mentioned about this being the dark archive. And now I'm forgetting the second one. That's large data sets. Right. Yeah, go ahead, Courtney. Yeah, in terms of data size, you know, I think that we've all worked with these large computing centers for so long that we know the answers to that question are like, well, we can do whatever you need to do if you have enough money. That's exactly what we're facing. Would you agree Sybil? Yeah, and I think different people have different ideas of what constitutes a large data set. I know in my mind, like one terabyte is not large, but I know in other people's mind it is whereas one petabyte, I would consider that on the large size side. We see that too at Yale. We see that the vast difference in what people are actually talking about when they're saying large or big data. Yeah. And the other thing that I wanted to get back to because there was the Department of Defense data mentioned, and I just did want to say that one thing that we haven't looked at in this grant is classified data. And I think that that would probably follow, that would probably require additional considerations. Yeah, I think that's right. I don't know if that could fall under that bar that we've created at the HIPAA and purple purple level. Yeah. To follow up, David Millman has a question on asking you to Courtney to address a little bit more about what HIPAA without audit means. Is that not then HIPAA compliant? Well, what I meant by that is the HIPAA compliance audits are extremely expensive and we have the advantage of UCSD and the San Diego Supercomputing Center and the Texas Advanced Computing Center already having undergone those expensive giant audits. So as two of our nodes, that's awfully convenient and it cuts costs. So if we were to add a third node, there would be a much higher cost to get us to HIPAA compliance. So all that to say, what we have discovered in our use cases is that there is very little actual HIPAA and purple content being held in the custody of the libraries and archives that we have spoken with. In the use cases that we have collected. And so what I meant by that is a HIPAA like without the actual expensive audit. So as close as we can get to it, renaming some of the essential elements in the HIPAA compliant nodes but without actually going through the rigamarole and the cost of the actual audit. Thank you. And we have time for one last question which is from Tim McGeary. He asks, how are you addressing privacy of data? For example, data may have been collected from users who may not have considered data storage in such a manner. You know, privacy policies have changed substantially over the years. So perhaps he's thinking of, you know, historical data that archives might want to acquire from a researcher or something along those lines. Anyone like to address that? I think we can start to, and Courtney and Sybil can chime in. I think that one of the ways to answer that is by reiterating that what we're finding is that there is a lot of that kind of data, historical data where there might be private and sensitive information in it in archives, for instance. And that our potential users suspect that's the case but they have not done the assessment or full assessment to know if that's the case. And so I think it is definitely a concern. And it's part of that challenge of readiness that we're encountering, kind of left and right with this. Thank you. And we've reached our time and so I want to thank Courtney, Christie and Sybil for giving us a really excellent presentation and some really clear answers to the questions. And I thank all of the participants for taking the time out of your crazy days that we're experiencing right now to come together and discuss some really important themes for our CNI community. So thank you very much and take care, everyone. Stay healthy.