 Thank you for coming to this session, you know, I wanted to talk to you a little bit about a project that we've undertook at UVA both to reduce our risk to user privacy, but also to enable librarians to do research on the data that we've coalesced. So today I'm going to be talking a little bit more about risk to user privacy, which we've heard a lot about at C&I today and yesterday. Also looking at trends for circulation data retention across peer institutions, talking more in depth about this project that we took on to lower our risk and enable library research, look at the aggregated data that we collected from across disparate library systems and enterprise systems and how that data is being used, and wind up with challenges that we overcame, challenges that we still face, and possible opportunities. So the first thing I wanted to talk about was academic freedom at risk. And I wanted to come at this from a couple of different directions starting off with a couple of examples at the University of Virginia. So in April 2010, the Virginia Attorney General, Kuchinelli, filed a subpoena seeking documentation of a climate scientist at UVA who was there in the late 1990s, early 2000s, and then went on to work at Penn State. And basically they launched an investigation saying that Michael Mann and some of his colleagues had actually made up fabricated research data to support climate change. Many organizations and individuals came out in support of academic freedom in this case and really saw it as a turning point. Over the next couple of months, it went to a Virginia circuit court who dismissed the investigation based on a lack of financial evidence. But soon thereafter in 2011, the American Tradition Institute, a free market think tank who also denied climate change, followed this dismissal with a FOIA request. It would take four years to come to the Supreme Court of Virginia, who then unanimously affirmed that the earlier circuit court decision was correct and ruled in the university's favor. But I can tell you, there was a lot of time on the part of our general counsel fighting to keep this information private. Also in 2015, we saw a lot of higher ed institutions come under attack from various nation states and other types of security breaches. In June that year, a federal agency notified UVA's CIO that in fact two nation state actors may have access to university systems. As you can imagine, this is something that no CIO wants to hear. And that summer turned into a massive undertaking, not only to find out the extent of the compromise, but also to plan out a remediation without alerting those actors. So this involves some 176 staff at the university, one government agencies, and two commercial companies who commenced to determine the extent of the damage or the compromise and also devised this plan for remediation. By August 2015, many of us who were unaware of what was going on found out at five o'clock on a Friday afternoon that the university was going dark. And for the next three days, the university undertook this massive undertaking to remediate the systems that were compromised and clear up the problem and bring the systems back online. And I'm happy to say that they did a fabulous job at getting us back up. UVA in the end learned that two faculty in particular who had visited China had been under surveillance for more than a year. In 2020, libraries published a lot of research on the growing problem of online surveillance and censorship. And by April 2021, Relics, a subsidiary to LexisNexis signed a contract, a multimillion dollar contract with the U.S. Immigration and Customs Enforcement Agency, ICE. The ICE contract revealed that LexisNexis databases would offer an oceanic computerized view of a person's existence and would provide the agency with the data it needs to locate people with little or no oversight. Both Relics and Thomas Reuters have built global surveillance systems that include online tracking. They have a massive growing aggregation of user data and evidence of a cell of services based on that tracking. And I'll wrap up this section by talking about the Patriot Act. In October 2001, the act was established and over the next more than a decade, it would continue to be reauthorized even with evidence of abuse. In March 2020, an effort to extend, once again, did not pass the House of Representatives and now the law is dead, but you might think, why do we care about this in this context? And it's because of Section 215, which stated that surveillance did not need to be reported. So even though we have all of these instances where we know that people have been under surveillance, it's what we don't know that's also deeply concerning. Because there's a lot of user data out there that's been gotten through this surveillance. So I'll bring it back to UVA and talk a little bit about the risk we were under at exposing private information of individuals. We had circulation data dating back from 2009 forward and this circulation data included the university ID, the computing ID, information about what people were researching, and that could be used to get other things like their first name, last name. And you might think, well, why didn't you clean that up? Well, every time we tried to clean it up, we would get a lot of pushback from individuals within the library that found this data vital to the research that they were doing in order to improve services, find lost items, have evidence to go back. We went to a very central budget or distributed budget model where all of the schools began being taxed for administrative services. So we had schools coming and saying, we don't use the library, so we don't want to pay that library tax. So we needed this data in order to show them that, yes, you do use the library and here's how. Also this data is interesting for informing collection building through usage trends, validating surveys, and in some cases staffing. And our primary risk was not for you because this is considered business data, but it was at risk for court orders and obviously security breaches. That security breach that happened at the university actually started in a department on an un-patched server that had access to the active directory centrally and from there those actors were able to hack into, I think, 20-some systems at the university. So one of the first things we wanted to do is think about what kind of retention schedule we wanted. And so we went out and talked to a lot of our peer institutions. And what we found is that people largely are not keeping circulation data at all or keeping it under 30 days or having an opt-in situation where the user has to choose. We do have a few people that we found that are keeping things for six months, a couple for a year, but there were only a few people who were keeping data unlimited like we were and all of us were striving to go to zero or go to some lesser extent. There was a lot of debate at our library. Some people wanted to keep it for a year. Some people didn't want to keep any at all. We finally got to a point where we could compromise at 90 days. And then in a year or two we're going to revisit that and see if we can go to zero. So our project goals were threefold. We wanted to reduce the risk to use a privacy. We wanted to augment the data that we had from these logs with data from other systems like Aeon which we now use for circulation of special collections. We use Archive Space which describes manuscripts and we have an in-house app that handles requests from faculty and students for digitization projects. So we wanted to coalesce this data from the different systems but we also needed data from enterprise systems. Every time we needed anything that had to do with roles of people who were using the library or information about schools and there was almost never any information about departments until recently in central systems, they would have to jump through a lot of hoops to get this data. So we wanted to explore what we could get from enterprise systems to enable better research and we wanted to provide an application that would allow them to search, discover and mine the data that we eventually archived. So one of the first things we needed to do was collaborate. I was at some earlier sessions where people were talking about the importance of working with an institution review board and in this case not only our senior leadership had to weigh in but we also worked with the institution research and assessment who had access to a data warehouse that could pinpoint the time and the date of a circulation record and tell us more about the individual and I'll later go into the data. The institution review board for social and behavioral sciences helped us look at the ethics of the data that we were proposing to collect and archive and we also worked with central ITS or technology services and the chief information security officer. We don't have a privacy officer. He sort of wears both hats to look at our overall system design look at the data that we were collecting look at every aspect and really be a consultant to our project to make sure that we were securing everything in the right way. And so all of this collaborative work ended up impacting our approach. So we revised our retention and archiving schedule. We worked on a de-identification but still retaining a unique it's almost like knowing that it was a unique individual without having their personal information. So we did that by using a salted hash where we had two pieces of information about a person that doesn't change very much. And then other special characters mixed in with it create a salted hash. This enabled the library to understand whether it was 50 people checking out 50 books or one person checking out 50 books and of course it had an impact on our security and access. So the data that we ended up collecting some of it was from the logs and some of it as I said was in this other systems. Time and date we were able to get school and department. This unique key is that salted hash I was talking about. The family group and family name has to do with identifying what role the person serves at the university. There's also roles. Of course borrower profile is part of our data. Faculty type whether they're degree seeking or not the degree level. This or library isn't populated with meaningful information right now but we think it will over time where a person identifies which library they prefer to work with. The station library item library location whether an item is part of the reserves. Skip that item data that's not meaningful. Primary subject and subject fields which was real holiday. Class schedule item type format publication date and language. So we wanted to give them the ability with this application to make available all the different possible values some of which are incorrect data but at least they would see it and be able to sift that filter that out. We also wanted them to be able to search by any combination of selected attributes and data so they could really narrow down the data set rather than having to sift through a large data set for something unique that they're looking for. We wanted the ability to be able to browse through pages of results to identify that they were actually getting the data that they needed. Often these terms and the values that turn up mean something different from what they thought it meant. So a lot of times they'd have to go back and modify their search to get the right data set but once they have the data set that they need then they have the ability to export the data for further processing. After looking at all the different ways that people needed to conduct the research we decided that it would make the application too complicated to be able to take care of all that and there were other technologies out there that could better do the job. So some of the technical decisions we made were to automate everything so that we could meet the retention schedule. We wanted to base it in cloud infrastructure because all of our applications, most of them are already in the cloud and the ones that aren't are quickly moving in that direction. We decided to use a solar index for speed and flexibility and we wanted to export in CSV format for post-processing and one of the important factors was the ability to archive all of these source files that we were getting from different systems because once things came back to us from our ID identified we had no way to go back and recreate that data. So we really needed to archive those source files for disaster recovery. I wish that I had our library assessment library in here. She wouldn't get on a plane and I appreciate that but she's going to do a much better job at describing how she went about forming a sandbox and the different things but I'll do my best. So one of the things that Annette Stahlnacher appreciated was access to the living data and the aggregation across systems because she was one of the people who had to go through these hoops every time, you know, the administration wanted to prove one thing or another. She would have to work with all these different departments and coalesce all this data from other systems and of course it never came in in a timely fashion. So she really appreciated being able to aggregate all this data from across systems and have it at her fingertips. She used a couple of examples to conduct longitudinal analysis which she does on a regular basis and she's also the person who works with other people to put together our surveys and she's used to looking at results that come back and how do you know, you know, is the information that you're getting actually valid or not. So she sees the data as being useful for validating surveys or doing what she calls myth-busting. So one of the first use cases that she used as our School of Commerce, McIntyre School, was one of those schools that would come back and say we don't use the library so we don't want to pay you the tax. Well, what we found out by looking at the data is actually that's a myth. Circulation was consistent. She actually pulled data from 2009 all the way up through 2020 to look at, you know, has it changed over time? Are they right that people don't use the library or are they consistently? And they are consistently using it mostly for books and DVDs. Also she learned that the most heavily used library is Clemens, which now has more collections but traditionally that was thought of as the undergraduate library. And she found out that the most common borrower profile were undergrads. The next question she asked is what classifications at a high level are faculty using? So by department she wanted to know which faculty used the library most often and not surprisingly it's English and history and they already had a pretty good idea that that was the case. And they wanted to know the library most used by faculty and they also weren't surprised to learn that the main library was the library that they used the most but we also knew that everybody, the faculty love our Leo service which is our library delivery service for faculty. She also wanted to verify that the P's as expected had the highest circulation and it did English literature was the highest subclassification of the P's. So she didn't really turn up anything that was earth shattering there. Next, Beth Blanton Kent or Beth Blanton I think she now goes by informing library collections. She was the person who for over a decade had been begging for this data to be pulled and data from central systems because she is constantly being pressured to, you know, build collections in the direction of, you know, a changing landscape among curricula, you know, greater degree of interdisciplinary studies with diversity, equity and inclusion. She's constantly asked, how are you serving these populations? So she wanted to know who really uses what, who do I talk to? You know, what are the usage patterns say versus the anecdotal feedback that she was getting? Like Annette she wanted to do immediate long-term analysis of the collection use and have data at her fingertips to be able to do better management, better maintenance decisions. Another thing that she really desired was can we correlate user success to collection use? And this is one of the big factors I think where we come to budding heads over collecting data is, you know, we want to know if the library actually contributes to user success but at the same time we have this commitment to protect their personal information. So we still do not have a way to correlate this but we're trying to find other avenues because as we all know when you give up information to another entity it's very hard to get it back. One of the cases that she looked at does interdisciplinary or shifting research impact collection usage and she wanted to look at archaeology which she thought what she would find is the art department that has the archaeology studies or anthropology. That's what she expected to find. What she learned is that history usage left those other departments in the dust and she learned that student usage was double that of faculty which was a real surprise and that most materials were circulated between 1 and 2 p.m. She still doesn't know the answers to why this is so but now she's wildly interested in talking to people. Another case that she looked at because they had just overhauled their Arabic collection plans was who uses the Arabic collection and surprisingly she learned that the school of continuing and professional studies also heavily used the collection and learned that overall use was evenly split amongst faculty, graduate students and undergraduates. She had never thought about contacting that department at all about the usage of collections and certainly not the Arabic collection but now she's more informed about which departments to talk to about the uses of different collections. And then I had Cecilia Parks on the project who represented teaching and learning but also all of the people that are working with faculty and departments and students and the things that Cecilia was interested in for her area was understanding use patterns amongst departments, subject areas and all the different user groups that they work with. Again, verifying anecdotal evidence, you know she keeps hearing that undergraduates say they want fun fiction and she was curious what does that mean? Like what are they checking out? What do they think is fun? Identifying collections and areas for outreach and development and also identifying areas for instruction which is primarily where she's focused you know maybe building tutorials for finding books in a specific location or a call number range. And she was the one who originally came to me and said somebody told me that you were the person who has talked about basically coalescing this data in the past and I'm working with diversity, equity and inclusion efforts and we're really interested in knowing more about that and knowing how that compares to how we're building collections and how people are using the library. Unfortunately, even though that's what started the project discussion off, that was something that we found we were not going to be able to do and primarily that was because the enterprise systems did not even support the concept of gender and race until the last few years and even when they did sometimes people were only given a binary option and so the data in the systems is not only incomplete but it's factually incorrect which I wish was the only place we found incorrect information but I guarantee you it was not. We also see a potential use on the front lines primarily if you're looking at location usage trends for instance like a department primarily using Clemens. The item location versus the circulation location like maybe if you find out that the majority of this collection is being requested in this building maybe you want to move those items over there so that you can make things more efficient. Also circulation trends by date and time you know when are your peaks how many circulations are you doing at this service point versus that service point should you have more students because we employ a lot of students on our front line services should we have more of those working in one building versus another. So some of the challenges that we encountered I've already talked about the DEI data system and data migrations across the university whether it was an enterprise system we've changed I've been there a long time but we've changed our HR systems at least three times the library just migrated data from ILS to ILS to ILS and didn't clean things up for good reasons but there's a lot of stuff out there that we've been trying to clean programmatically but in some cases people have had to go in and do things by hand so central systems aren't the only ones that have interesting data in fields we wouldn't expect but we're appreciative of the things that we do have. Also there's a problem we have to weed the collections and that left a gap so if we were looking at a circulation log we only had part of the data so we would want to look up in the ILS and get current information about that data like you know what are the subject fields what are the subfields what are all these other location information about that particular item obviously if you've weeded the collection and you no longer have that record in the ILS you're not going to be able to do that but we still kept as much data as we could about that item and because librarians are able to see the data and filter on the data it's much more visible to them now than it was in the past the other big gap the largest gap is that we have an increase of digital content and especially with the pandemic and on non-learning we're only going in that direction further and so far the University of Virginia Library has been a place that did not want to make people authenticate for everything that they access so I know that I've been asked many times in the past why can't you tell us how often how much people use our spaces it's like well if you want to put gates at the door and have people authenticate and swap a card every time they come in we can tell that but the best we've been able to do with that sort of thing is put in sensors that don't do a great job at telling you how people use the space but it's the best we've been able to do without authentication we also don't require authentication for access to digital content and it's a real problem because you know this is going to make the data less and less complete as time goes on so we're trying to figure out what to do I was interested in the talk yesterday that Sarah and Lisa gave talking about these new technologies in ways that we might be able to leverage them and I'm wondering if we could partner up with the publishers to get some of this data back about this authentication delegation that they're going to do with institutions that have SAML in place you know UVAs had SAML in place since the 90s so we should be an institution that should be able to leverage something out of that and also be able to tailor how much information we turn it over to publishers so I wanted to talk about opportunities I'm curious as to how many of you have undertaken similar projects like do you collect this kind of data do you have these same kinds of research questions have you been able to I know that now that librarians have seen this data they want more they're thinking of other data that they'd like to collect and add also to this collection and do other kinds of evidence-based research and be able to make more data-driven decisions about improving services so I'm just wondering you know how many of you've done it and also you know what could we tell about the change in research libraries if we were to aggregate this data you know is there a possibility if we de-identified data at different institutions and we had an aggregation could it better inform us about trends and changes of discipline changes in curricula changes in student populations and the way they use our collections so I'm curious to hear and also answer any questions I want to quickly go through some acknowledgements this was our team I'm forever thankful for people working so hard for what took us about a year to get this off the ground I also read a lot of different research particularly in talking about the attack on academic freedom and these were some of my sources and of course I'm glad to answer any questions about the project but I'm wondering if some of you would be willing to share with me similar projects that you have or if you have questions at this point can I get a thank you can I get a show of hands how many of you have done some kind of data aggregation like this fair number Jamie hi Robin Jamie Wittenberg University of Colorado Boulder library is I'm curious if you could talk a little bit about your model at UVA for data governance and stewardship around these kinds of data at the library level and at the campus or university level and how those how those models either enabled or obstructed your ability to do this work so we have recently just in the past four years the CIO led an effort to come up with a data governance group and I think it includes people like from IRA and IRB the CIO's office in talking about these matters but projects like mine didn't really go through that they went in into the different components in that group instead so to check security and check ethics but I think as it matures over time maybe we'll get a chief privacy officer in addition to the security officer and that things will travel in that direction more often so I'd say it's pretty new and it's not really matured thank you Lisa hi Robin Lisa Henshleff University of Illinois at Urbana Champaign I really think this is really interesting and so thoughtful the way you've really tried to grapple with some really challenging issues here of you know if you're being challenged by hey our students don't use the library it's not enough to just say a lot of students use the library you need to be able to respond to that challenge what I'm actually interested to ask you about is could you say a little bit more about the intersection with IRB and particularly it feels like you're saying that at least some of this is falling outside this purview of IRB but does any fall inside and under what circumstances it was okay it was really interesting I expected them to weigh in more heavily and what they did they were glad to help but what they initially told me was we usually just work on grant projects but we're happy to talk to you and so you know we went through and I explained the whole project to them and they did actually help us think about the ethics and they said that the way we de-identifying things they felt like even in the event that we went through a security breach there wouldn't be enough data in there to actually damage the reputation of anyone it wasn't powerful enough to actually impact or damage and that was their main consideration but I expected it to be a lot more involved than that so what I'm trying to understand so do you actually have an IRB approval for this or they consulted on the design they actually approved okay so you are then able to publish anything you want out of this data or would a publication trigger the need for an additional look at the particular research project no I mean we're able to present on it they didn't they didn't restrict any publication because it's so general and you can't identify the person and we don't have any PI in it so okay I appreciate hearing about this so for the last three years I've been running a project in Illinois called Carly Counts which has been teaching librarians across our state about how to gather data analyze it and the like and all I can say is that this past year where we had people doing projects in teams where we would have projects with like four different institutions so that actually meant four IRBs weighing in the inconsistency about how libraries are treated by IRBs is a major challenge to the slide four slides ago that talked about doing compare contrast aggregate including I mean it's I it's a whole talk I won't give it in a question comment here but this is really interesting I mean we have one project we have two projects where essentially one person had to essentially withdraw from the project so that the project could go forward so even though the University of Illinois at Urbana-Champaign with a medical school our IRB approved it etc a particular community college would not approve the same project and so I don't think you know it's it's just I think so this is one of the things I want to kind of raise in the context of what I think we are needing to do which is to ask ourselves how are we doing this across the field is kind of trying to surface the differential treatment of library data within the IRB uh rubric some of the IRBs also wouldn't do IRB because they said it's management data not research data but that's fine unless you're on a project with somebody at the University of Illinois and our IRB says it's research data so now we have two rulings one of which is it's outside of IRB one side it's inside IRB so sharing some of those experiences but I really wanted to understand a little bit more of what your IRB did which is a whole new version I appreciate that because I think IRB took more of the first stance which is we only work on grant funded projects for researchers why are you coming to us so yeah it's interesting hey I'm Yasmine Shortish I'm at James Madison University I have a question um in particular about that this idea of how do we know if art students are using the library or whatever did you all have any discussions about how to answer those kinds of questions without collecting data well it was I would say it wasn't totally impossible to find out about students as a large group so because of the borrower profiles right so we get information from student systems although I will say staff override that because they have the ability to in the ILS so for instance I'm treated as a faculty member and I've been staff forever so I have a different borrower profile but it doesn't give you any kind of information about the department or the school and when you ask central systems about that you would get partial answers and people really reluctant to give you the data and even now talk on the enterprise systems if we weren't working with the people that had the data warehouse it would have been very difficult and for a long time the university didn't even keep information by department which has also come up I just have a follow-up I guess I'm I'm wondering uh I don't have high confidence in like down the road our abilities as libraries to keep the data in our circulation systems and in enterprise systems like to have agency over it because of the vendor tentacles and so I'm I'm just wondering if we go to data because it's there and we think it can give us answers expediently and efficiently but are there other kinds of methodologies that we should be investigating to to actually move away from an over reliance on data and this data driven ethos right and that there may be other effective ways that get to other things that are harder to count about relationship building and growth over time surveys do you guys do surveys yeah we've always relied on surveys to give us a lot of that information as well but of course that's you only get a certain amount of response and I've heard a friend of mine was sharing this last couple of days about a massive effort to survey students and they don't yet have the results of the information that they got but they got really good response you know our response rate varies over the years and so you wonder you know how valid is it to extrapolate from the population that is actually presented with the survey in the first place and then the number of people that answer it which can sometimes be pretty low hi Robin I think it's the Virginia contingent here so Irene Harold VCU and so I'm wondering have you approached Viva as to look for partners or just I meet regularly with John Unsworth and Tyler Walter of Virginia Tech and and your Dean so you know maybe there is a possibility of us all collaborating love that idea you can tell him I Irene sent you okay if there's nothing else I think you can have an extra 15 minutes thank you