 Data Mining Partnership for Library Operations, remarks by Scott Britton and John Renau at the 161st ARL membership meeting, convened by William Walker. Now it's my pleasure to introduce two of my colleagues from the University of Miami, John Renau and Scott Britton. Together they have responsibility for collection strategies, education and outreach, scholarly communications and customer service and satisfaction. We all work for a very data-driven provost. I always say you can go in and tell Tom LeBlanc the nicest, most compelling story, but he won't give you a dime as a result of that. He needs to see the data. And so we're very conscious as we make recommendations, make decisions and evaluate programs that we can back it up. One of our problems has been is that we've never been able to bring data from across campus together, from the registrar's office, from student affairs, from the library's data banks and put it in a warehouse and start to mine that. But we've made progress during the last year thanks to the work of Scott and John and a computer scientist who's an expert in data mining, Mitsuo Gihara. We've broken down faculty senate barriers where they suddenly decided that they owned and governed the access to who was in the library rather than the library. Managing that data, we've had blows with the IRB and we're here to tell about it. So I'll turn things over to them. There's some really unique results that they've uncovered. Thank you, Bill. And we want to thank everyone here also for the opportunity that we have to share our initial findings from the library data mining project. We're first going to discuss the partnership that we formed on campus to make this happen and then move on to our initial findings. Academic libraries are increasingly required to offer evidence of their effectiveness in supporting institutional goals. User surveys and assessment of instruction sessions are often used to provide evidence of library effectiveness. But while these tools should continue to be used and they're a significant source of data, there are limitations to their effectiveness in showing unbiased impact of library services. Specifically they rely on users to provide accurate reporting regarding their use of the library. They also entail the creation of the assessment tool, time for users to complete, and often incentives for the users to complete them. An alternative to surveys and feedback forms is to use existing data and data mining techniques to reveal patterns of use and correlations between library activities and user achievement. While data analysis is a common library practice, the data is usually contained within a single set of manageable size. But data mining on the other hand entails use of very large data sets or multiple data sets that are then analyzed to reveal patterns. It can be a very powerful way to assess library operations and library impact. For many libraries, deep analysis of their data may be difficult or impossible due to user privacy concerns, lack of system interoperability, or simply staff unfamiliarity with advanced data analysis. So these were the issues that the University of Miami faced when trying to use the large amounts of data that we routinely collect. So John and I selected data mining as our research project in order to address these problems and put our data to work. Forming partnerships with researchers on campus is an efficient way to benefit immediately from the experience of others, both in the analysis of data as well as management of user privacy concerns. We did just that by entering a partnership with the University of Miami's Center for Computational Science. The main purpose of the CCS is to assist researchers in the analysis of very large data sets. So they work primarily with faculty at the Miller School of Medicine, the Rosenstein School of Marine and Atmospheric Science, and the School of Engineering. The CCS provides secure processing and expert analysis of large and or sensitive research data. Bill mentioned Mitsunori Ugehara, and he was really the idea behind this data mining project. We learned that he was conducting research in the digital humanities, and we approached him because he was doing the same sort of analysis that we were wanting to do in the library. So examples of the questions we had included who isn't who is not using the physical library, how does book use differ among subject areas, and does library use correlate student success. So we sought the expertise of Dr. Ugehara with the goal of discovering patterns of use and demonstrating library effectiveness. But in addition to his other responsibilities, he is now formally associated with the library as the associate dean for digital library innovation. Can you hear me in the cheap seats? Okay. Okay. The research team was comprised of Dr. Ugehara, staff from the Center for Computational Studies, Scott, and me. Many of our research questions required sensitive data, such as grade point average, and logon activities to our electronic resources, and therefore we were required to get approval from our institutional review board. The research protocol was written in IRB exemption was granted, and fairly quickly actually. Much of the needed data was not in possession of the libraries, so requests for data were sent to the registrar, human resources, and the Office of Student Activities. Having IRB exemption and a research partner that routinely works with private medical data made the request for external data much easier than expected in most cases. After a brief explanation of how the research project will proceed, most parties were gladly willing to grant their data. Ironically, and Bill mentioned this, the only data that was difficult to get was the library turnstile data. Although we were owners of the data, and we are owners of the data and routinely access it to determine the numbers of people who entered the library facility, we needed to ask Central IT in order to get data for large periods of time. This triggered a protocol that eventually required approval by the chair of the faculty senate, and for those of you who visited Miami, you know that our faculty senate is indeed very active. To receive IRB approval, we needed to guarantee that all data will be purged to personally identifiable information. The data was further cleaned by restructuring it so that various data sets could be integrated into a single data set and stored in a data warehouse. The process to do so involves submitting identifiable data to the vault, which is a secure saver at the center for computational science. This is similar to what is used to clean for private medical records. What the CCS did was to replace the university ID number, which is equivalent to a social security number mapping to a single person, with a unique project ID number. The project ID number was assigned to demographic information, for example, academic department, major, campus role, et cetera. But no personally identifiable information, such as a person's name or the university ID, was retained. The unique project ID allowed various clean data sets to be linked without compromising privacy. Analysis of the various data sets did not allow lookup by name, university ID address, et cetera, and results were not provided that might identify someone based on a unique combination of demographic information. For example, a junior male majoring in physics, that type of information would have been cleaned out before we could see it. So through analysis, patterns emerge from the data as expected. And preliminary findings are discussed here. And we want to call them preliminary because with each result, and anyone of you who've worked with large sets of data, know that each time you get an answer, you end up with more questions. So it's this process of iterative query and a cyclical review of data that makes data mining such a powerful assessment tool. As we review our results, I want to just be clear that circulation and catalog data is including all university Miami libraries, but not the law or medical libraries. And the turnstile data is only for the central library. So analysis of library checkout activity by user type shown sort of, actually, that is strange. What we see here is a little bit different than what you have here. Okay, so sorry about that. In the middle column, you can see library checkout activity by user type. And it revealed a pattern pretty much what we expected, that the percentage of checkouts roughly corresponds to the size of the population with undergraduates checking out the most number of items, followed by graduate students and then faculty. But then we asked for the percentage of each user type, the checked out material. And then we found that only about 20% of undergraduates checked out library materials and only about 16% of faculty. And what surprised us was the low percentage of faculty checking out library materials, but also this is only for two semesters. So people could be away, they maybe just need the library that semester. So we need to look at this data over a longer period of time. And collecting this data was a bit of a challenge, because our ILS doesn't retain demographic information with its circulation activity. So in order to collect longitudinal transaction data that could be associated with demographic information, we took weekly snapshots. Everything checked out at that moment in time and sent them to the CCS. And they cleaned it up. But also that involved cleaning up the data multiple weeks, because a book is often checked out for more than one week at a time. So they had to remove any items that were repeated week after week. But so while not a complete circulation history, it resulted in a very large set and a very good representation of circulation activity that we could then link to demographic data. We expected to find that print use varied by department, and that assumption would be that humanities would use the print collections more than the sciences. And that assumption held true. Here we see the top five departments as a percentage of total circulation. And as expected, English takes the top spot by a large margin. But we were not expecting to see law among the top five. The law school has its own library and a separate library system. So we would expect the law library to satisfy its user's need for print materials. And from observation, we know that members of the English department frequently use the library. They're very vocal, active community in the library. However, the English department is also very large. So in order to understand the relative extent that each department uses the library, we asked the percentage of people within each department that would check out materials. So this required using registrar and HR data to determine the total number of people associated with the department or major. And after doing that, all of the top five departments support our observation that researchers in the humanities and social sciences are the heaviest users of our print collections. We are curious about why music is so high. We removed CDs and DVDs from the calculation. We're looking mostly at book circulation activity. And it's still about 36% of the people in the music school are checking out materials, but it could be that scores and other shorter use materials account for that. Next, we'll look at the turnstile data for the central library. And this is relatively easy to collect because we do have a turnstile at the front door, everyone has to swipe the card, and we get a log of all of those card swipes. We do know that undergraduates use the library in the greatest number, but they're also the largest group. So our question again was not so much the total number of each group, but the percentage of each group using the library. And by asking that question, we were able to determine that the data shows that graduate students actually use the library in a greater percentage. Again, we've been hearing that graduate students needed more library services, and we're providing those, and now we have evidence that that is actually true. These numbers do seem low to us considering that in any given month during the year about 100,000 people come in through the library. But closer analysis revealed that there are quite a few very frequent users of the library, so although they increase the total turnstile account, they don't increase the number of unique users. So here we have turnstile results showing the top five departments as a percentage of potential users within that department. What's very surprising to us is that the School of Music, Architecture, and Law are among the top five departments despite having their own library. So all three of those schools have their own physical libraries, but yet they're still coming in the greatest numbers to the Central Library. English obviously continues to be a department that uses the library heavily. So because we saw that law school uses both the physical library and checkout of print materials from the Central Library system in great numbers, we've since actually had the law librarians come to the Central Library for a tour and Central Library staff go to the law library just so that we have a better understanding of the spaces and services that are provided in both places. We also conducted analysis of the usage of different types of the physical library materials. Books continue to be the most commonly checked out material type. From there, data varies depending on the type of user. For our undergraduates, we found that after books, the next highest circulation in terms of materials was DVDs. For our graduate students, it is musical scores that scored the second spot and for faculty members, CDs are in this category. We also investigated whether or not there are differences in checkouts among the four different undergraduate classes. For books, the pattern is what we expected. It is clear that the upper level students are checking out more books than the first and second year students. This pattern doesn't hold for DVDs, which are more evenly distributed among the four classes. We think that that's probably due to the fact that the DVD collection, although created to support research, teaching, and learning, also includes many popular Netflix titles and its recreational viewing that is accounting for a large portion of this usage. Libraries often look for tools to help determine which materials to keep on site and which materials to move to remote storage and what materials might be serviced from consortial arrangements. We looked at the average length of time from the date of cataloging to the date of last checkout for our print materials. Circulation data in our system is only available from 1991 to the present, so items cataloged in the fiscal year 1992 were examined. The total number of days between the catalog date and the date of last checkout was calculated and then grouped by call number. Here the data shows how the use of books varies greatly depending on the discipline. As expected, philosophy, history, and music and art display longer periods of use than books and the sciences. The predominance of C-call numbers, which in library speaker, for those few of you from Dewey institutions, are referred to as auxiliary sciences of history, results from users working in the history of civilization, archaeology, and biography. The large number of M or music materials that circulated points to heavy use by the music school of library resources. We assume that circulation data will grow beyond 20 years, or we assume that as we are able to examine this data beyond 20 years, an even greater difference in book use will be revealed between the humanities and the sciences. We'll use this data when we speak to faculty about selecting materials for removal to remote storage. And you go and press print and the little box pops up and says page one of 36,000 and you get a little alarmed. And that's the use of electronic resources. Tracking visitors to the library's online resources was indeed more difficult. Licensed databases are authenticated by IP address. Users on campus may access them without authenticating at all, and so therefore no user information is available to examine. Furthermore, IP addresses at our institution are not assigned systematically, so we were unable to group users IP addresses into anything that might provide demographic information, such as department or campus role. However, off-campus use of licensed databases does require personal authentication. For those accessing databases by the main library portal, authentication is provided by a proxy server, so that it appears to database vendors that users are working from campus IP addresses. The proxy server keeps an audit log, a very long audit log, showing university ID and the session number assigned to the user for the period for which they are logged in. Data from the audit log was uploaded to the data warehouse and cleaned. The way demographic data information was linked to the audit log data provided a partial picture of off-campus use. The law and medical school users use different authentication systems, but they do have the option of authenticating through the main library, so certain use by these users is captured. We realized that although there were some insurmountable impediments to gaining data about which users were accessing specific resources, we could obtain information about the different categories of users who access the library's portals. Here our preliminary findings are both interesting and counter-intuitive. We found that a large number of medical and law users who have their own libraries with independent authentication systems are choosing to access materials to the central library's portals. We know that among all users, STM materials, science technology and medicine materials predominate as the most accessed content. We did find the expected STM databases and journals that did appear among our most accessed resources by remote users. However, with this group of users, ProQuest products were the most accessed. This is one example where our findings prompted further investigation. We use Summon as a default search surface across all libraries resources and Summon is, of course, a ProQuest product. When we contacted ProQuest, we learned that ProQuest databases use a level of indexing for inclusion in Summon search results that hasn't been implemented by all their data providers. This may be causing ProQuest products to rise to the top of the list in Summon results, which could explain the pattern showing user authentication. So our final research question was related to library use and student achievement. So we did a correlation analysis comparing turnstile activity and checkout activity with undergraduate GPA. And so you can see there's one yellow highlighted area showing a very weak positive correlation between turnstile activity and GPA. But we also know that not all majors require physical use of the library or even books to succeed, and so we needed to do a further refinement of the analysis. So by breaking down undergraduate results by class year, it shows that seniors do have a much stronger correlation between GPA and library use. The other three classes have a very weak positive correlation or no correlation at all. And breaking down undergraduates results by their various schools and the College of Arts and Sciences in the areas of humanities, social sciences, and SDM revealed some puzzling findings to us. Architecture, music, and law are the only schools that showed consistently positive correlations between GPA and library use. We also conducted, for peace of mind, a correlation between use of the university gym and GPA. And thankfully, there was no correlation to the gym, so the library is at least doing better than the gym at student academic achievement. In order to maximize our understanding and generate action-worthy conclusions, we will form several cycles of analysis. We know from our work to this point that the academic achievement we know from our work to this point that each question we answer generates yet more questions. Our next level of analysis will reveal checkout patterns for materials focusing on academic department to determine if researchers in the area of study that we assume are heavy library users, such as literature and history, are indeed generating more use than shown in our initial findings. We also need to understand why users whose primary affiliation is with schools that have their own libraries, such as music, architecture, and law, are making use of the central library to the extent that they do. Our data circulation on books and different subject areas over time is interesting, but more information is needed before we can comfortably act upon it. More detailed subject breakdowns will help identify those materials that should remain on campus and those which might be serviced effectively from either remote or consortial storage. We know that specific LC classes might skew results of the broad subject classifications. We also know that we need to add more years to the analysis, moving beyond just looking at those materials that were purchased in fiscal year 1992. The correlation between library use and student achievement is always difficult to determine. The need to access physical or electronic library materials varies by discipline. Looking at more specific data regarding academic majors will be necessary to reveal an accurate correlation between library use and student achievement. We also know that non-library factors influence student achievement, so these must be analyzed as well. Finally, and perhaps most importantly, we would like to work with the Software Engineering Division at the Center for Computational Science so that the principal investigators can query the data without intermediation from the CCS. Speaking of which, we do want to thank the other members of our research team. They did a lot of the heavy listing that made these preliminary findings possible, and we look forward to continuing to work with them. Thank you so much. Amazingly, we're still on time. We have time for a couple questions or comments for Scott and John. Ann? Hi. Terrific, terrific report. A couple of things to ask you about. First of all, have you done any correlation with other similar studies of print usage that have been done? I'm thinking, as you were talking about the Cornell print usage, the Walker Report, and what it showed in terms of the use of books over the last 20 years, we need to start thinking about are these sort of studies replicated or if distinctive at each of these institutions? And there are some obvious false drops that'll come out, so the heaviest use at Cornell of non-English language material per school and college was in art, architecture, and planning, and of course what they were doing was using the graphic content rather than the text-based content. But what I really liked about your report is the use of the central library by different schools and colleges with their own libraries and sort of a light bulb that goes off for me is as I talk to those respective deans about the importance of supporting libraries beyond their own library is having evidence around the heavy use that they make of the library beyond that, so I think that's terrific. Bill likes that too. There are going to be quite a few conversations. We obviously did a literature review looking for the use of data mining techniques among libraries. We looked at some reports but really we needed to get started and actually to meet some deadlines on analyzing our own research and we do want to look at how we compare to other libraries but I think at this point we need to dig deeper. We need to answer those questions. Why is music coming in? We knew that there were heavy users of the music library but why are they coming into our library and certainly when we look at funding the law school I don't think supplies any funding to the central library. Bill can speak to that. We need to understand why they have a big library. Lots of good study rooms, nice spaces coming into our space. Wendy. Wendy Lujay from Minnesota. We actually have been doing the data mining as well on 13 different variables where only where we can capture somebody's identity and I thought you might be interested in a little compare and contrast and maybe we can make some conclusions about weather. We found that if you controlled for the demographics, college environment and academic variables using the library one time was associated with a .23 increase in the first year students GPA. Now we can't you know the correlation we can't say causation or those sorts of things. When we looked also at retention controlling for those same variables a student if they use the library at least once was 1.54 times more likely to reenroll so getting some good retention data. We did however also look at using the gym and there was a much better correlation than yours. You've got to come in from the cold. Hey it's Miami. Yes please. Hi, Greg Gary University of Hawaii and as a music major I suppose it's inevitable I should have become a librarian. This is fascinating information. I just had one question that may be related to your when you pointed out the difficulty in dealing with databases because we have noticed similar results in our statistics dealing with a circulation of non-print materials specifically CDs DVDs that sort of thing but I was wondering if you had a way to monitor any of the streaming products because that's where we've been moving to not only third party streaming such as films on which we found very useful for documentary material curriculum support but we also do our own capturing of material that we have archival videos that we capture for and have copyright ownership for we then stream those and we do count those statistics so I didn't know if you counted any of that are the populations accessing this non-print material in different ways than just going to the library and checking out a DVD or a CD. One of the challenges that we found was that for a lot of our electronic resources despite I know where a lot of us are in the same boat 70% of our budgets go in that direction we know it's the majority of our usage but there were these bars to finding out what user groups were actually doing what with them and that's the issue that we confronted when we started to try to look at the streaming project products and again individual accesses of any of our resources so we know who the users are from our proxy audit data but we were unable to find a way to match up those user sessions to know who was doing what so I have the raw data that can say for example films on demand how many downloads there were over a period of time but I can't match that up to know that it was a medical student or that it was a music major we tried very hard and it was heartbreaking when the doctoral student came back to us and said because the URLs are dynamic and because the user sessions logs work in the way that they do that she stood on her head and she just couldn't find a way to do it and that was a disappointment. Thank you. Appreciate it. A final question or comment? Join me in thanking the fellows for their work and presentation. Thank you for listening. Music was provided by Josh Woodward. For more talks from this meeting please visit www.arl.org