 All right, I think I'm going to get started. This is a session about protecting researcher privacy in the surveillance era. I know there are a lot of competing sessions, so I boiled this down to one slide. You can review this if this is not what you're interested in. You won't hurt my feelings if you jump up. So we're going to talk about patron data. We're going to talk about the identity politics that come along with patron data. And then some things that I think need to be mitigated at this point. Here's an example. So the thesis here is that we know too much about researchers at this point. And the legal and network privacy protections have failed at this point. So we need to make some changes in how we manage patron data. Pretty close range on this. Before I even begin, I want to thank some people. Professor Talithia Williams at Harvey Mudd College and her Math 158 students have helped me all along the way here in analyzing data. And this has been a great collaboration. So they love taking our data and cleaning it up and doing regression analysis on it as an exercise. So we've been doing this every semester for about five years. I love getting the results and having the interactions with students. So it's been a very positive experience. Pat Flannery is a networked system administrator at Pomona College who has allowed me to basically march all over the wireless controller. And Rebecca Lubisar, associate dean, who encouraged me to come do this in the first place. I am Sam Combe, director of strategic initiatives and information technology at the Claremont Colleges. If you're not familiar with Claremont Colleges, briefly it's a consortium of five undergraduate and two graduate colleges with one library, which means it's really important for the library to understand how the campuses are interacting with our resources and services. So that's part of what I do is study the patron activity along with our assessment librarian. For really legit reasons, student-faculty centrism, we're very user-focused or patron-focused, visitor-focused at the library. It drives everything we do. So I want to understand what they're doing. We need to convey the value of the library to the colleges. We're funded by formula from each of them. And so if one college doesn't feel like they're getting their value out of the library, that's a big problem for us. On the other hand, if we're doing really awesome stuff, we want to be able to convey that too. For planning, future planning and justifying services that we have, I need to be able to understand what's going on with patron activity. And if we have events, which we do. So if we have a discourse event with a faculty speaker, we might want to be able to analyze that particular event. Now the tools that we've all had available to us over the years kind of don't cut it. Probably all libraries have gates, electromagnetic gates that make sure books don't go out through them. They also count people. So any electromagnetic disturbance through that gate gets counted going and coming, which means that that data is pretty useless because you're just counting relative activity. I can go and come out through different gates. I've got three entrances in our building. I go out to lunch. I come back. How many times do I count that? I don't know. So it gives you this kind of relative busy-ness number that isn't helpful. It doesn't tell me what campus they were from. Headcounts. We have a four-story building. It's hundreds of thousands of square feet. Headcounts take a long time. And they also don't tell me very much. So I can see there's a person there. They're asleep. I can see there's a person there. They're studying. They're using their computer. I don't know what campus they're from. I don't know what year they are. I don't know what program they're in. I don't know much about them at all. Going back even, well, going back a ways, we also had cards. So this was an interesting way to keep tabs on patrons. I remember from my cradle, the cards would have people's names on them. They'd be scattered all throughout the building. That seems like kind of a privacy problem. Glad we stopped doing that. But it was difficult to collect it. I think it looks kind of naive today in this age. This book, which is an engineering book, probably seemed perfectly safe to check out back in the 60s and 70s. And today might cause somebody real problems. Chapter 2 is about how to make explosives. And chapter 9 is how to ignite them with electrical devices. Can imagine some of our patrons might not want to be in possession of this book today. We moved on to integrated library systems. And the cards went away. So that was kind of a good thing. But now we've collected all the patron data into the ILS, where it can be subpoenaed by the FBI as of 2001 with the USA Patriot Act. And so libraries started scrubbing the ILS circulation data, which was good for patrons, bad for us, kind of hard to know how busy patrons were or how many times things were circulating. But it got us this cool button that says we're radical militant librarians, so kind of a net positive. In the integrated library systems, this is the sort of data you can see. I know you can't read that. The stuff that I'm keying in on is in red. So there's just a little bit of stuff in there. It's not a lot of stuff. That's the takeaway from that. So that's personally identifiable information. In our ILS, we don't keep any demographics. We just keep contact information. There's a necessary network ID. And I want to talk about this network ID, because that's the key to all of this data. There's another bit of PII. Everybody's got a barcode or an ID number of some kind that's probably related to Blackboard or an LMS or something like that. And we keep a phone number, but that's opt-in. We don't typically call patrons. We just want to be able to tell people, hey, your book is here, or please return that book. And they can provide those values. But we have to have that network ID. Now, that might be called different things in different places. I think if you're sitting here, you probably read my little disclaimer in the front. This will be relevant to folks who have central wireless and LDAP and things like this. So in LDAPs or Shibboleth, it's called a GUID, or what are they called in LDAP? It's got a C name. Anyway, it will be your network ID. It looks like an email address, but it follows you everywhere. Now we have a new tool. Well, it's not that new, right? Everybody's had wireless networks for a long time. Going in. And what's new since maybe 2008 is that they started centralizing. When I first arrived in Claremont, we didn't have a central wireless network. It was a nightmare. We had 40 independent access points that were little computers. And at any given moment, one of them was down. And you had to go find it and reboot it. That was not awesome. So instead, we installed this system where you have these access points that are much dumber. They're basically FM radios that are connected back to a very smart controller. And so all the traffic is just tunneling back to that controller and then tunneling through the network out to the internet or wherever it's going. And the controller has a log. The other nice thing about the nice and bad, depending on your perspective, the access points have locations. And if you name them after their locations, you can get a richness of data that we couldn't get before. So here's an example of a bit of the wireless controller log. So it has a log file just like a regular old server. For my purposes, I'm taking a small extract of stuff that I'm sort of interested in. I'm actually thinking of just not collecting some of this stuff anymore because you'll see why in a minute. But it has that same network ID in it. So everyone who is in range of a wireless network wants to connect to it. That's just a truth. If the wireless in here went down, everybody would leave the building. It's like oxygen. Folks who've connected to the wireless anywhere on the Claremont College's campus, and I'm sure this is true on your campus too, when their device comes in range of a wireless network, it tries to connect to it. Whether or not you ever take it out of your pocket, it has pinged an access point. When it pings that access point, it gets logged. And this is what a log entry would look like. So you've got your network ID, the MAC address, is the machine addressable code. That's a unique ID for your device. And then depending on the configuration of the controller, there may be these other things, like when it connected, when it disconnected, how much traffic went over it. The vendor of the device, it's an Android device made by Samsung with the exact radio. There's just a lot of information here that you could triangulate somebody with, even if you didn't have that network ID. And so in our case, we've named the device for where it physically is. So it's a CUC, a Claremont University Consortium device in the MUD building. We've got two buildings, a Honoled and MUD on the third floor in the Keck 2 classroom. So I know that that device is within 20, 30 feet of that access point. And since that's an Android, it's in somebody's pocket or on their person. So I know where that person is. If you do nothing, the wireless controller can take the place of those gate counts. So this is just a dashboard view from the wireless controller. It's telling me that there are 111 clients, which are these electronic devices. On the network, there's 141 more on the encrypted network and there's 24 client guests. Gone out and ground truth this and figured out that everybody's carrying 1.8 devices. Some associates at Northern Arizona University went through the same exercise. They got 1.8 devices. So that seems to be a pretty solid number. We don't actually need to count devices though. We can take it another step further, which we will momentarily encounter the network IDs. So if you take that network ID, it's got my campus on it, split the campus off. So my network ID is SAMK at CUC. CUC is my campus. So now I can tell top row there is all the campuses. We've got CGU, Claremont McKenna, Harvey Mudd, KGI, Pitzer, Pomono, Scripps. I know how many people from each of those campuses showed up in May on which floor. That's much better than a gate count. And at this level, I'm kind of comfortable with that knowledge. Like that's an amount of data I think is okay. I can't tie that back to any individual person. So this is really useful. Rebecca Luebus, Dean I mentioned earlier, had asked me, so at what time period during the day are all seven campuses represented in the building? Like, is there an hour? Is there a minute when that happens? I was like, I don't know, but I know who does. And so we asked the wireless controller and it turns out it's from nine in the morning until nine at night. All seven campuses. So that was novel information because we had campuses telling us, oh, no, our students never used the building. We've since learned that, yeah, actually, everybody uses the building and for many hours at a time. Okay, so we talked about the ILS, we talked about the wireless network. This is nothing new, but proxy logs or Iliad logs or you probably have a number of different electronic resources, services, servers. Well, they log things. And if you're in a Shibboleth environment or an LDAP environment or a CAS environment, that's gonna be that same global user ID more than likely. So again, mine's Sam Kay at CUC. You probably have some session information. Now, easy proxy is sort of endlessly configurable. Ours is pretty simple. But even so, I can tell you what journals this guy looked at at any time, what articles he retrieved, what search terms he used. And if it was Google Scholar, if you actually proxy Google Scholar, every time they press a key on the keyboard in the search box, that gets locked. So if you look up, give me a book title, it's gonna be F, F-O, F-O-R. It just goes on and on in the log. Somebody tell me why that is so we can tell Google to cut it out. And then really, really interesting, this would be a great clinic project of some kind. Every time somebody is reading a PDF in the browser, every time they turn the page. So we can see page turns, which is a fascinating kind of engagement measure, but also getting a little creepy. I know some folks had noticed that Adobe, the Adobe DRM platform was actually sending that data back to Adobe initially. And I think they've cut that out, but that's something we need to keep an eye on. In plain text, right, who said that? Thank you, yeah. Now I see you. I need to know who's in the audience. I know I have a computer science professor. Any lawyers? No, seriously, I have a law question coming up. Okay, great, awesome. I am not, I told you what I am. I'm a librarian. I am not a lawyer. I am not an IT specialist. So I'm a jack of all trades. I'm probably doing something that's horribly wrong. Please just tell me. I don't wanna send out wrong information at any point. Okay, so we kinda covered where we came from. We had information scarcity and now we seem to have a glut. So I wanna talk about, what are the implications of that? In this report, which I know I've seen a lot of people citing this report on authentication and authorization last year. If you haven't read it, please read it, memorize it, and then we'll all talk about it later. But I just wanna bring out three things from that report. So from the survey of academic institutions and others, I guess, was this notion that content suppliers seem to have no plans to support Shibboleth. So I'm not gonna be able to go to JSTOR and log in using my Claremont credential. I'm totally fine with that. I came to that conclusion a long time ago. And so we did the next thing, which is to proxy everything. So whether you're on campus or off campus, we proxy that. And that's actually fine. In a Shibboleth environment, it's a good user experience because our users have logged into Box.net, they've logged into Sakai, they've logged into any number of different things. If they have logged into something on the Claremont network, they're not gonna log in to use our resources. It's just gonna send them right through. That browser knows it's authenticated and they're good to go. You kind of have the opposite problem. You have to kill the browser to kill the session. So somebody know how to fix that problem. I'd be glad to hear it. So it's a seamless user experience. That's important for us. We wanna make sure that our users aren't getting hit with login boxes every five minutes. The other thing I like about it is that then the content providers only know the IP address of that easy proxy server or service. So they don't see my laptop or my phone or my tablet IP address. They're not gathering PII on me. They just know they're talking to this proxy server. So with that in mind, we're proxying everything. Here are the implications. The personal, physical, online, and by personal, I mean PII, the name, patron type, that kind of stuff, whatever demographics you've got, can all be joined together by the network ID. That was kind of a revelation to me. When you think about it, there's some really great stuff that we can find out here, but then it gets a little too close to home. So research activity by user, that's not new. We could do that before with easy proxy. But building use by visitor, that was new to me. But then you combine all three of them and I can tell you research activity by location by visitor or research activity by visitor type. So on campus in the library, web of science use by faculty member from HMC whose name rhymes with, that's totally doable. It's kind of trivial, I know, because I did it. This is not a hard data problem. What does that look like? So here's some actual data from May of last year. These are Harvey Mudd faculty members. Counts of unique faculty members. So the first column there, these are folks who are sitting outside our Digital Humanities Studio on the third floor of Arnold. 37 distinctly different HMC faculty members were there. I don't actually need to know that. I kind of just made this graph just for y'all to show you what's possible, but it gets worse. So again, these are access point names and we've named them after where they are. So I can tell with pretty good granularity where folks are and actually for the folks in the front who might have seen it, there's a signal strength measurement that's also in the wireless controller lock. So with that, I can drill down even more. If you've got a weak connection to that access point then you're farther away from it. If you've got a strong one, you're closer to it. If there's three people with the same signal strength, you're sitting together. So let's drill in on that a little bit. More is always better. These are the same HMC faculty and we're just looking at the actual individual faculty members now. So I cut off their names. I'm drilling on that. Repeat visits by those faculty members. So faculty member A visited 35 times. Faculty member B 28 times to that location, which is undisclosed. Let's take another step. So this is for all of fall semester of 2016 and we're planning a renovation on the fourth floor of the library. So we kind of wanted to know who's up there, what are they doing? And I just took the joke a little too far. So this is e-resource uses on the fourth floor by graduate students of the Claremont Graduate University. They use EPSCO host a lot. That's not a big surprise. We have a lot of stuff licensed under EPSCO host. They're going through OCLC, which is either catalog use or Iliad. JStore use, eBury use, ProQuest use. Okay, well that's good to know and this has a huge long tail. I cut it off at 20. But let's look at one of them. So this is one graduate student on the fourth floor. Fall semester, this is primarily what they looked at. ProQuest, Wiley, JStore, Springer, Taylor and Princess online. They didn't use Science Direct so much so that's interesting to us. That's not a cheap resource. Well, what were they looking at? They were looking at this. So here's a search, there's a resource, there's another search. Again, this is not difficult to do. And I just picked three of their URLs out of the easy proxy lock. So if you're a relational database person, you could throw this into access, Microsoft access. It's at the outer bounds of what access is comfortable with but it's for a semester, it's a couple gigs of data. So did that creep anybody out? Okay, good. I thought maybe I'm just getting worked up over nothing but I tend to think we can't ignore these data. To me, this looks like kind of a new problem. Same problem as 2001. The same reasons people stopped keeping the circulation logs but we actually, so to the lawyer or legal, obvious, I don't know. We have some new twists. So I noticed in December, there was a revision to the federal criminal code, a rule called Rule 41, which I think it has to do with search and seizure. And the way I read that thing now, any judge in any location, so a judge here could write a warrant for any computer or computers anywhere, it's California, to retrieve whatever. And they don't need to be notified, okay? Nobody needs to be notified of this except law enforcement. The only provision that I can see in there is that somebody thinks that computer's up to no good. It's encrypting its traffic. Oh my God, it's on VPN or using BitTorrent or something like that. But that's not actually language that's in the law. So it's pretty open to interpretation, I think. So that being the case, if these log data are just lying around, then we have a problem because they won't be coming to our door. They're just gonna snake it out over the network and we won't know. They can still get a warrant from the Visa Court and come and retrieve it. That's true too, but it kinda creeps me out a little bit that they don't need that. Well, so one of the things that we've seen is that the networks are compromised. And that's not me. So Cisco in 2014, I have their quote in the presentation is, we've given up on that. We're all about mitigation now. Your network is compromised. What are you gonna do about it? I'm glad you asked that question. So for that reason, but also for the reasons that we're seeing now that either NSA or FBI seem to have the ability to intervene, get into your network with or without notification. I love RIT guys to death, but I doubt that they are ready to stop that. So I'm worried about both of those angles. In our patrons, oh, sorry. Yeah, the FBI, any model of your network can give you the best possible. So here's the thing. If maybe that's a high threat and maybe it's a low risk, are we willing to take that risk? So we're seeing now that our patrons are not implicitly safe either. So that example of the book that I gave before, we just don't know anymore what the status of patrons is and what's kind of the perceived threat level of what they're researching. We've got a lot of patrons researching terrorism right now. What does that look like to a law enforcement agency? We've got a lot of patrons who are from countries that are apparently not welcome here anymore. So my responsibility to safeguard those patrons is on my mind and whatever that threat level, our actual risk level is of the network being compromised, I just wanna really examine what data we've got lying around and what actions am I taking to safeguard it? We'll talk about that in a second. So I think I have an obligation to protect those data and basically for our part, I'm looking at three different actions. If we've got PII that we think we need anonymizing it, hence the questions about hashing, if we've got these data, let's really examine what do we need and aggregate it up to the highest possible level. So rather than keeping it at that graduate student's network ID, let's keep it at graduate student. I just need to know they're a graduate student. That's all I need to know. Then it's a lot harder to trace it back down to an individual person. And that includes all that triangulation data. So if I know they're a graduate student, but I also know all that business about it's an Android device and it's of this vintage and it's got this operating system on it, that's plenty to tie it back to an individual person. So triangulation has to be included in that. And then expunge what you don't need. When I first started this project, I started poking around looking for log files and I found years worth of log files just lying around because why not? Storage is cheap. There was no real impetus to go delete it. So we had easy proxy logs going back forever, wireless logs going back forever. So making it not just a policy but a procedure going back to delete those logs. So I'm still in the process, I have to admit of that. It's a big job, anonymizing and then deleting the raw data. Other right hand button. So this is what that looks like to me. So proxying everything I think is one step to take. If you're not doing that and you can, why not? You can test this out with a limited set of resources and even patrons to see if the user experience is good enough. But then the way that works is I ask a question like what are the ethics of religion-based executive orders? The proxy server re-asks that question to political research quarterly. Political research quarterly is having this conversation with the proxy server, not me. It sends back the PDF, proxy server sends me back the PDF. At no time did my name leave the network. Yes? Browse fingerprinting techniques like that. Yeah. Tracos that Germans insert in their HTML. So the publisher is actually getting a lot of information which is way more than enough to identify an individual with them. Thank you for that question. It's above my pay grade. But that's the kind of question I was hoping somebody asked. I get that just of what you're asking. And I think that that's one of the mitigations that we're gonna have to take and really look at. Maybe not at my level but in IT shops to find solutions to that. Or at least advise patrons that that's happening. And that's something I didn't put in there but I think is actually a part of this is just being transparent with the patrons that some level of this analysis is happening. Whatever that level is at your institution. Let them know. So anonymization, I'm just using a very simple, maybe too simple anonymization process. I just wanna show you an example and the definition of it. I think it's important to use a standard hashing. Don't make up your own. There are lots of good libraries to use. So NIST has a secure hash standard. They have many. I'm using SHJ256. What happens with that is when it sees Sam K at CUC it turns it into this gibberish right here which you can't reverse. So nobody could take that gibberish and turn it back into Sam K at CUC. Now Sam K at CUC is not exactly a giant secret. So that's not sufficient. Somebody could get all the network IDs for Claremont trivially, no black helicopters. And then just try a bunch of different hashes until they found the one that they took off our logs. So we do salt everything. There's an example of what that looks like. It just means add something to your initial value and then encrypt that and don't tell anybody what that was. Now done properly, there's a lot more to it but that's an example of that. And again, that's all the code it takes. It's not, we're not talking about rocket surgery. The personal data, so that's another big debate, I guess. I don't think it's very, I don't think it's debatable and I don't think it's important. Anything that looks like PII to me, I regard as PII. There's no cost to, you know, especially anonymizing or aggregating one versus two versus three different values. So certainly the network ID, because that's absolutely tieable back to me and a bunch of different systems. The IP address of the machine, so, you know, right now at Claremont, everybody's got a public IP address and I don't think that's uncommon at academic institutions. I don't always get the same one but I looked over time and pretty much I always get the same one. So I just go ahead and encrypt that. The machine addressable code, now that we're carrying machines in our pockets, I regard that as PII. That's pretty personal, it's in my pocket. It's always the same for that mobile phone, laptop, whatever it is that you've got. So I encrypt that too. And then really to pay attention to the demographics. So with us, with the ILS, our access services director was just like, I just don't want any of that information, I don't need it. Let's not have it. We didn't even have this conversation. She was just like, it's just more work for me, I don't want it. I just need to be able to contact the patron every so often. So that was easy. But there's a lot of other demographic data or triangulation data, HTTP headers. Thank you very much for adding to my headache. That we need to be aware of in this environment. Encrypting connections. So HTTP, HTTPS, and there are devils in the details with HTTPS, but more and more and more we're doing a bad job putting pressure on vendors and services to use HTTPS in the first place. I think our main websites should just go ahead and be HTTPS. We just get everybody in the habit of dealing with HTTPS. I think everybody's probably experienced the mixed content messages or nightmares where web pages have big blocks of nothing and your browser's unhappy because there's mixed content. So some of it's HTTPS, some of it's HTTP. It seems like a really easy problem to solve. Just everything's HTTPS. Everything's sense secured. And that is all I have for you today. I'd love to ask some questions and answer questions if I can.