 Welcome to C&I session. Privacy gaps in mediated library services, and follow on some suggestions about what to do with them. I'm Michael Altman. I'm director of research for the newly named Center for Research in Equitable and Open Skeleton at MIT Libraries. This is Jeff Work with Kit Haines, who's a student. Katie Zimmerman, a colleague at MIT Libraries. Lisa Hitchclitt at University of Illinois, will present how to. I'll talk about some of the problems, and then my colleagues will talk about how to solve them. If you're interested in diving into some of the problems, this paper is not yet out, but a lot of related work on privacy is. And we'll have a white paper version of at least the first part by the beginning of the year. So some brief background. Libraries and privacy. Oddly enough, libraries have cured about privacy for a fairly long time. And this comes from at least two different sources in the international softening of the fake library condition. Privacy is viewed as a human right that is part of the universal declaration of human rights. And there are rights related to it. And it's also a right that's implied by the right to access information and right to free expression. If you can't access that information in privacy, then you may be subject to chilling effects, to retribution life. The US has a different tradition of privacy. American libraries recognize the right to privacy within the library context since 1939 have its relationship to freedom of expression, freedom of access to ideas. Privacy and confidentiality are identified in the list of core library values by ARL, by ALA. And it's based around a professional value of supporting free inquiry. But unlike the international tradition, there's no human right of privacy that's recognized in the US. A legal tradition and the right tradition is more of a mosaic. So legal requirements for libraries better be over the place. If you're serving an internationally located or serving an international audience, which we increasingly do, whether or not we do that deliberately, we may be subject to laws such as the GDPR, the General Data Privacy Regulations, which covers EU and EU citizens, provides a host of rights for data protection and data privacy. If you're primarily US, serving US patrons, there is no equivalent of the GDPR. There are some protections for library patients under 13 under COPPA. There are Patriot Act may require this loads of information on the other side. Many of us are in the universities, which work under FERPA, which has privacy rights for students. And state laws also across the US vary in their protection. From they may have separate information data protection laws in which that apply to information collected in libraries and or a specific library information laws, most of which are keep the library's information protected from open records requests. And here are some examples of state laws. And we can also be under different contractual applications if we're processing financial information, BCI, DSS protects personal data, and vendor contracts may as well. So there's a mixed background. There's a deep set of values that libraries have with respect to privacy. But there's a mixed set of formal law and regulation. And over time, our model of accessing information has changed, as has been noted. One of some of the implications of this change, as more and more content has become digitized or available in more video form, is that more of more content, more interactions are not taking place in the library itself. They may be taking place in a patron's office or a patron's home or a third place. They may not be interacting with content that is held and controlled by the library. Maybe information that the library has acquired rights to, mediated access to, but is housed, accessed from a third party. And even the systems that we use to manage interactions with our content and our patients may be hosted in the cloud something. We may not be running those. So we are interested in how privacy protection in libraries has changed or maybe should change in light of some of these shifts. So the other part of background, and it'll explain where we're focusing, is that, again, no surprise to most of you in the last 30 years, there's been somewhat of a concentration of scholarly communications. So a lot of the content that is in university libraries and research libraries, patrons are accessing, come from a handful of publishers. And much of that content, they access through portals provided by those publishers. Maybe mediated, bought, enabled by libraries, but the interactions may occur elsewhere. So we did some research. We looked at vulnerabilities in two areas. One in what information can we see being collected? And then what do publisher privacy policies say they can do with such information? How strong are the protections that they give and write? It turned out that the first is a lot easier to do than the second. We're interested in are there systematic gaps? Are protections different for things that we keep versus vendors? What practices are necessary? How are changes affecting things? Those are broader questions. We may do it. But here's an example of the use of Mario and I use MIT because that's where I live. Patron might use the MIT library system on part of a website to discover a journal to which MIT has provided mediated access. The patient authenticates through their IDs this way, accesses the journal through an MIT proxy. Perhaps this is one round. Content is provided with third party website, perhaps through that proxy, which may lead to further discovery, exploration, navigation, and content may be presented inside of a through that site through a proprietary embedded reading. So what's potentially exposable here? Well, they typically don't get the write off, the client's name in MIT ID. They're coming through a proxy. So we don't necessarily get that. But there's a lot they can't get. There's an IP address. That might be the IP of the proxy. There are cookies. Well, maybe those are local to the system. There is a browsing history. They're reading and sharing history if they start to log into that reading app. There's also the ability to do fingerprinting of the browser characteristics. So without having a name and ID associated with it, from the information in that transaction, you may be able to figure out this person is the same person that you would set in other places. And also you've got their name from this service over here. You may easily be able to do that. And if you can do that, you can target advertisements. You could recommend things that they'd like. That would be good. You could also attempt to insert more tracking thing, technology into their browsers. So you can learn more about them. Right now, we can't see what's going on at the publisher site once it goes to their server, but we can see what's happening on the browser. So we can see what tracking mechanisms are used there, though not everything that's being done with that information. And there are a number of different tracking mechanisms or things that might protect a little bit about tracking. HGP is good because that protects against third parties altogether different outside from trying to track. But when you look for things like ad placements, loading external resources, cookies, whether you got directed to a proprietary reader app that required a login and other things like active attempts to fingerprint. Think JavaScript that does active analysis of the browser to try and figure out who you are, which goes beyond collecting and passing information. And there are a number of tools that let you do this. Some of this ad pops up. You can see that by visual inspection of the intraocular impact test. Privacy Badger is one plug-in that will alert you of some of these things. DevTools and browsers and the Brave browser is a browser that's explicitly implemented to help you figure out when you're tracked. Who will use all of these things? These rows are different technologies. White is good, black is bad. So, encryption is good and if it's on, everything else is bad if it's all the way on. Columns are tracking. Columns are different vendors. So, this vendor is tracking a lot. So is this one. These are not so much. Now there's a little weakness in using a encryption. By the way, this is Elsevier. This is Perkwurst. There's a difference in the pattern overall. Do we see this difference in policy as well? And then the legal protections. This is some of the things that are active. Traces that vendors in some cases are trying very actively to collect information. How would we evaluate the strength of the legal protections that they're offering? This is a more complicated question than just looking at the tracking mechanisms. But we used 4.6. We identified a reference framework for what privacy principles and guarantees that we're trying ideally to measure. We took these that you'll see from the NISO principles, from ALA guidelines, and from some of the larger, more granular GDPR protections. How do we need to merge these and de-duplicate? So we harmonized across this set. And we'll show you a little bit of that harmonization. And then develop the measurement instrument, which is basically a well-worded question that we can evaluate via disagree or disagree about how strong that that guarantee of that particular protection is. And then we conducted repeated assessments, repeated assessments repeated over time. So we have some before the GDPR privacy policies came in, which we'll discuss much, but we'll be in the paper and after. And we repeated across multiple vendors and we repeated across multiple raters to make sure that we did consistent ratings for inter-rater reliability. So for the top-level framework, we're using, we've organized things around the NISO principles. And after looking at these various number of frameworks, this turned out to be a good set of top-level categories for most of the requirements that were in other frameworks guidelines could fit in at a broad level into the NISO principles. And they come out of the library community, so that's nice. These are the 11 principles I'm not going to iterate through all of them, but they include everything from accountability to security to education. And here's an example of a particular principle statement. These are not very, at that level, they're not very detailed. They're broad statements. So we went to guidelines like the ALA guidelines and privacy and security templates and the GDPR requirements, which are stated in more detail. There are some examples. And we developed across them. So under NISO principle, shared privacy responsibilities, it contains several different protections. One of them is training, that people who are in library systems, in data controller organizations should have security and privacy training. We map that wording. There is a, in this case, there's an analog to GDPR section and a rough analog to ALAO section. Rinse, lather, repeat. And we end up with approximately 40 different scoring areas. And I won't bore you with all of those. But the over, at the top level, they fit quite well. NISO principles are a bit more comprehensive than the ALAO principles. And sometimes a bit more detailed. GDPR goes down into more specific requirements, as you might expect. And we don't go down to the bottom level, we go down to approximately the bottom level. And goes beyond NISO in a couple of ways. One is to specify vulnerable populations. Do you, beyond children, which is something that COPPA addresses? To do the evaluation, we harvested privacy policies from each vendor, froze them so we would have a common copy, because they are changing all the time. Went through each sub-principle, designed a question around it by modifying the core of the NISO text and adding any, if necessary, specific additional protections from more specific regs like GDPR. Converted that to a statement that you can agree or disagree with. This, and the idea is that we don't know how well the vendor actually protects these. We are evaluating what the policy promises to do, whether the policy describes what they're doing in this issue or promises to protect it in a particular way. How it's actually protected, this is not an audit. Any statement that you can make in a statement, you can agree or disagree with any statement, we use a standard five-point scale. It's called a liquored scale, it has some nice properties. And then have independent coders, rank these, compare for iterator one. And which we are still iterating on, but the initial rank are quite good. We will refine it a bit. So we've gone through two, based on a previous version of the instrument, which was not as detailed. We looked at what might be worthy emerging best and worst players and use this revised instrument with more detail on multi-radar rope, multi-radars. One is the best score that you can get, and five is the worst. Five is, very much disagree that they meant this, criteria. ProQuest gets a solid B, that's very scientific, but. The three is science, the B was archer. Elsevier gets approximately a D minus. Of course, you can tell the fine detail here, with the red is Elsevier, polar is worse, blue is ProQuest. Generally, ProQuest is doing better, almost everything, there are a couple of things that they just forgot all together, which is why they go get fives. And if they, ProQuest had, my guess is that ProQuest had to have a bit more detail in describing some areas there, the score would be, the score would be even lower. There would be a bigger positive. We'll insert it. But Elsevier is pretty good about telling you that they're gonna collect, the reason they didn't get a five, is that they're pretty good at telling you that they're gonna collect everything. They're comprehensively telling you the information they're gonna collect, which is everything, which may not, not just everything that like you give them, but they're going to go and buy information from third parties, they're going to, they are going to instrument your browser and try to track you, they tell you all that. And they're pretty clear about what they're gonna do with it, which is anything, which includes defending their intellectual privacy, property rights with third parties or with you, anything they like. So, you know, so points, they do get points for disclosure, but points off for not actually protecting the information in the way that our library principles and values would house. Who's a drill in for that anything and everything? Here's a snippet of the process privacy. It's better in a number of ways. First of all, the language is much simpler, easier to understand, they do drill into the detail of what they do in later in document. They frighten it as your rights over the data versus we're collecting everything and you describe a data monetization approach. So there's a lot to like here. There are also some blind spots and lack of, lack of, lack of detail in different areas, but it is significantly better. What about the GDPR? What about the GDPR? Well, this is, this is post, we did these analysis for the post GDPR policies. Those are the ones that you saw. We also looked at the pre GDPR policies, but we haven't gone back and done the revised instrument with multiple liability. But Elsevier, the post GDPR Elsevier is still lousy like yourself. And one of the reasons is because laws can change and that drive changes in compliance, but it doesn't necessarily drive changes in protection or protection for, for you. And so Elsevier says you have the right under European and certain other privacy and data protection laws as may be applicable to request access to and correction deletion of personal information, restriction of processing, objective processing, portability. So you can request it. They don't have like an interface to actually control it. It's not like Google, the Google data management where you can turn off the data collection and take out your, you know, take out, take out your SMS texts and move them somewhere else, but you can at least request it if you're a European citizen and, you know, and if they decide that law applies. So they're doing what is necessary for compliance and your, your rights have increased if you, you're a subject to, if you're subject to GDPR, then you have more rights now. You have to. But this is, this is a narrowly tailored expansion. Some discussion recommendations. So a summary of, of what we found descriptive. Increasingly, there's a misalignment between stated library values in privacy and data protection and privacy practices. Why is that? Because a lot of the way this, that people are accessing content are in ways that we don't control us directly. So that has been maybe a blind spot. Data collection, broad, assertions of broad use, broad and, and active tracking are very common in this space. And this open access doesn't protect you. If open access is on a publisher site that is, that has bad practices, they're applying those tracking practices in the OASite as well. Some publishers do better than others. Look like the content portals like Epsco and ProQuest. They're doing less tracking than even for content that came from the original publisher. So it matters which route you come to. And Elsevier appears to be the most invasive in terms of their practices. There is a difference between law and your legal protection. But the laws currently provide inconsistent protection because there's a patchwork of regulations that apply. Licenses provide little additional protection in part because much of this data is not passed from the library to the vendor in which case some of these, these writers might help. But it's collected by the vendor itself as part of their processes. Even though the library may have paid for the service and directed the user to it, branded, etc. And the licenses are not built from the top up or bottom, top down privacy values and principles as we now understand. So there are a number of areas which my colleagues will talk about of which licenses could be improved. And this may be a low-hanging group improving, making them less opaque. Patients don't even know what we've negotiated for them. All they see is the publisher click-throughs, generally, which allows current licenses that are inconsistent, which makes them harder to interoperate, harder for patients to understand. And current licenses don't support evaluation and may be difficult to evaluate compliance or even to get the data back that is being collected to see how users are interacting with those sites and whether there's some issue, whether they're opting in or not. And just as a baseline, a couple of years ago, Marshall Breeding did a quick look at privacy practices on ARL libraries. Library hosted systems probably aren't doing so well, either. But we're not actually tracking them. There may be opportunities for designing privacy mechanisms into the way that we interact with vendors. Right now, there's no support for privacy-protected machine learning. If you want a recommendation system, you have to give your personal data. There are ways to give good recommendations without revealing your personal information, things like differential privacy or cryptographic tools for that. Similarly, there are ways to access gated resources without having to provide your identity. So there may be mechanisms that would enable users to protect themselves or libraries to protect users more, but they would need to be architected in that system. And then there's a question about awareness and consent and control. Even when a user, a university, has a contract with a vendor that provides some protections to data collected from users. That might not be known to the patron. That can't act accordingly. They know that there's such a policy. Often they can't view it. That policy might be unavailable. That contract might even be proprietary. You're not allowed to share your contract with other vendors. And it's almost certainly not presented to the patron when they're accessing the vendor systems. And there may be conflicts. If the library contract says, you should treat the data just as we would, but the terms of service says click through and you agree that Elsevier can do whatever you want, and they click through because it's their site, which wins. It's not clear. Probably the lack of anything else, lack of better licensed terms, it probably is the vendor policy that wins. So they're not designed in a way that makes them effective in either understanding or in controlling in this environment. So patrons access content that is purchased, mediated, even branded by research institutions by library, but it may not be protected by the policies that we would like or that we have in place if their access is solely through our access. And we need to design in some protections. It's not something, not everything can be bolted on after this. We have to, there are opportunities to improve accessibility of services for human and machine clients and to explicitly support affordances for privacy, like anonymous but authenticated grounds or privacy-preserving recommendation systems that would enable more ways of interacting than simply going through the publisher's UI and being subject to all of the tracking mechanisms there. And we very much need a standardized model community license. That's aligned to collaborative principles that's transparent and verifiable and that is consistent for all of the patrons, whether they're in Europe or they're coming from Arkansas and is also understandable so that people don't have to think, well, am I in the EU zone? Do I care? Because these can be very long and difficult to understand. So those are all the problems. There are some references and now to solve everything for you. Hi everyone, I'm Lisa Janicke-Henschliff. I'm the coordinator for information literacy services and instruction at the University of Illinois at Urbana-Champaign. This session that we proposed and then kind of got combined together turned out fortuitous, I think, because one of the things that came out of the National Forum on Web Privacy and Web Analytics that was held earlier this year at Montana State University and I want to particularly pause for a moment and give credit to Scott Young, who is the PI on this project with his colleagues at Montana State, brought us together. Lots of good ideas coming out of that symposium but this particular idea seems sort of like actionable in the sense of it's very concrete which is perhaps if we don't like what we are finding happening under the contracts that we negotiate, we should negotiate for something else. And so, go ahead and click. I'm going to have Katie talk for a little bit and then we're hoping to have a little bit of a discussion. Okay, so, hi. I'm Katie Zimmerman. I'm a licensing librarian at MIT, most of the scholarly communications librarian. So I negotiate a lot of these licenses for MIT. I also happen to have a law degree. People seem to find that important. So our proposal that we're hoping we will get some input on today is to develop and maintain a system of model privacy language for user privacy. So to address these concerns so that we're not relying on the user, on the vendor privacy policies which change all the time and as we've seen are not terribly effective. So if we don't like what they're doing, we should do it ourselves. We should do it better. This should be, as Micah mentioned, this seems like it should be a low-hanging fruit. This is something that the library community has done before and done well. We have lib license. We have other model licenses. We have the expertise to do this, so we really should. It facilitates communication and improves the efficiency negotiations as if there's something that has a community consensus already built up that you can just say, hey, can you adopt this? Then that makes it much easier to actually get it in there. It also makes it easier for the vendors to reply with if everyone's asking for the same thing instead of implementing a specific thing. If you're asking them to actually put in tech work and do a specific one-off thing just for you, you're probably not going to get that. But if everyone's asking for the same thing, then we can actually get some change. And it provides a consistent message. It's a way to let the vendors know that this is important to libraries and that this is what we want going forward. So starting points. The starting points are largely what Mike had talked about. So the NISA framework, the ALI privacy guidelines, GDPR and this privacy framework and the existing model privacy language. So I do want to acknowledge that there is user privacy clauses in most of the... I actually haven't looked at them all, but maybe all of them. All of the existing model licenses. But there are still some gaps. So the liblicense language, for example, is pretty good on what vendors can... restricting vendors from passing data to third parties, but it doesn't really mention what the vendor is doing themselves. So Elsevier can't outsource that, but they can do whatever analytics they want on you themselves, which they probably are. But that is a basis that we can work from. Next slide. And areas of consideration components. I'm not going to dwell on this too much, because I think it's more important to talk about it. But the things that have been identified in the project that I'm working on with Micah are things that we should be addressing in this language. So this came out, as I said, the forum at Montana State, but one of the big takeaways was Lisa and Katie probably shouldn't sit down and write model-license language, especially Lisa. And so how exactly... So in part of the question is, you know, a bunch of us, five of us decided this was our passion project when we were in a room for two days, but we wanted to get some feedback from the larger community on the desirability of this, the feasibility, and then sort of like, what would it take to actually operationalize creating this kind of vital language? Where does it get hosted? How do we do this as a community? So though we have other examples, we don't want to assume that just because somebody is hosting some other model-license language, they'd like to host this too. But so this is really the open time for comments, questions, problem-posing thoughts. I think literally, yeah, that's... Thank you, everyone. Thank you.