 So there are four of us that are going to talk today and what we're going to do is address different issues in privacy and on data security. So it'll be myself. So the session is libraries and user privacy. I'm Peter Brantley at New York Public Library, Gary Price at Infodocket, Eric Hellman at BlueJar, Marshall Breeding at Library Technology and so I'm going to do a brief intro and then we're going to have, let me see if I can get the order, Eric, Gary and Marshall in succession. So what I'm going to do is I'm just going to give sort of a setting background discussion of some of the major facets that we're concerned about and then we're going to start drilling into some of the issues relating to libraries and privacy. So this became an issue greater than it had been because late summer or early fall there was now a well-known incident where it was discovered that Adobe Digital Editions was transmitting information about reader collections back to Adobe and not only were they doing that they were doing it in the clear which is how of course people found out about it. So aside from the mind-numbingly stupid engineering at one of the partners that we do business with on a daily basis, I think the thing that was most shocking about this was the realization that there's a tremendous amount of information about what we read and what we consume on our machines and on the network that winds up being transmitted elsewhere. So this was something that a lot of us started paying attention to. This particular slide is from a friend of mine, Liza Daley, who's the VP of Engineering at Safari. She was very instrumental in helping shape or understand the contours of this particular data leakage as was Eric Hellman and a few others. There are a couple of things I want to underline as sort of foundations. One is that increasingly as our services, as library services, start moving to the cloud and as many of us experience through the discussion, the opening discussion of this conference, there's almost an actual urge or momentum to moving our platforms into third-party services. And as that happens and as we get more skilled at manipulating open web software and designing solutions that take advantage of the opportunities that open web software and hosted services provide, there are cases where our needs as organizations are not always going to align with our users, either because of defensive reasons which could range from liability or security, network security for the organization to proactive ones where we might want to design services that our users might not want to take advantage of because they require data gathering. The other clarification I want to make, a very important one, is that privacy is not security. These are very different concepts and they meld and increasingly they're melding in really important ways but it's important to distinguish them in your heads. So privacy at the most basic cut is the information that's known about particular users with whom it's shared, whether or not you've given permission to release that information to others. When we think about security, we think about sort of traditional IT centric aspects of how we govern our organizationals network presence. So we think about things like defense in depth, intrusion detection, contingency and response protocols. And we also think about very discreet tools and services that we use to protect our networks. Things like data encryption, wire protection, segmenting our networks and firewalls. These are very specific things that we do to protect the security of information that flows within and through the networks that we operate. Now the sort of raw data security I think had not been that much of a pressing issue for libraries particularly despite the continual drumroll of data breaches that surrounds us in the news except very recently there was a hack against the Wyoming state library system potentially by external out of state intruders that breached their online catalog. Now on one hand the data that's breached in an online catalog is not necessarily the most critical personal information. It's often purely directory information and as such might even be publicly available in many cases. We are pretty good at throwing away circulation records so that would not have necessarily been exposed in this case. But there are a couple things to note. First of all the appearance of this is very poor obviously for libraries. The last thing that I want to see in the New York Times is waking up to discover that New York Public Library's patron database has been exposed and is now available for download on the torrent. Not exactly the kind of good morning America that you want to see when you're working in a large institution. The other reason and a reason that Eric may touch on briefly is that pieces of information cannot be considered in isolation. Information can be joined together and merged to learn more about individuals than might otherwise be apparent from just one data breach in isolation. So this combination of information brings out small numbers or actual identification problem for our library users. So this concept of personally identifiable information is not just the information that might be held within any one particular database or any one particular network service but really we need to think about what PII might be generated across services that individuals have access to. And personal information is really quite ubiquitous and in ways that we really often don't think about. For example in a lot of urban libraries there are actually video cameras aimed at least at the circulation desk to protect the staff there. This is a fairly common thing but those video cameras are now digital cameras and so they're recording data that itself could be prone to digital analysis like facial recognition potentially available for legal pursuit if that material is not regularly purged. We all know as IT people that we generate a tremendous amount of information in logs and we are also cognizant that inevitably there's retention of that information despite our best efforts to eliminate information in backup copies. We design mobile apps increasingly that take advantage of common services and utilize the ability for individuals to log in through social media networks. These kinds of logins leak a tremendous amount of data into the networks utilized by the social media applications. And locations geolocational software libraries are nearly ubiquitous at this point and as a consequence we at NYPL and many other libraries are interested in working with these services to provide better assistance to patrons wherever they happen to be in cities. So in short increasingly libraries are working in an environment where there's not only a great deal of data transmission but we actually want for our own purposes to gather as much information about patrons as we reasonably can to create services that serve them in ways that would not have been possible in older or more traditional libraries. And in turn that orients a large urban library and certainly historically researched libraries into single sign-on solutions which enable the ubiquitous collection of information about patrons and users of our services regardless of what kinds of applications that they run across whether it's a donation system or a journal application or an e-book platform. So in the current library we have these conflicting aims of trying to figure out how to best protect the data that users generate and keep it internally safe and use it responsibly for generating new services while trying to ensure that as little data flows outbound as possible without the user's understanding and acknowledgement that this is indeed happening. Because it's inevitable that there's going to be some data leakage through our networks and I think that if we pretend that libraries are just going to be the last best safe refuge we're really doing a disservice to our users and to ourselves as well. We really need to educate our communities and our users about the way networks work and about the kinds of information that flow through them. So all in my opening here with this slide which was a dialogue between Eric and Liza the VP of Engineering at Safari coming on the heels of the Adobe Digital Editions exposure. And Eric had highlighted that even in Safari which is a very reader conscious system that there were cookies that enabled certain kinds of user tracking that Liza as a VP of Engineering had not considered the full ramifications of. So Liza responds immediately and very responsibly that she'll restrict the scope of those cookies and prohibit or try to prevent the kinds of actions that might occur. And Eric closes with I've yet to find anyone who understands the privacy implications of their own websites himself included. And I think when we turn our eyes to the services that we're engendering in research libraries and public libraries I think this is very much of a true statement and it behooves us to acknowledge that and work with our user community to deliver the most informed dialogue possible with our user community. So with that intro I'm going to hand it over to Eric who will dive in into much greater detail on these issues. So the last couple of years I've been working on on glue it and what we are trying to do is make the world safe for free ebooks. And part of that mission has been to make unglue it safe for libraries. And I had a librarian working for me for two years, Andromeda Yelton, she was basically fresh out of library school and she really sensitized me to the importance that libraries at least in theory put on protecting user privacy. But in this transition to digital information and especially the transition from print books to ebooks a lot of that commitment to user privacy has to be talked about in the past tense. So why do I show you my Facebook page? Well just last week on Thursday Facebook whose business by the way is to be an advertising network. You may think that of them as a social network but their real business is of course to sell you advertising. Facebook decided to show me an advertisement for this book which you probably can't read it is the architecture of open source applications which we had just added to unglue it. Now the reason it showed me an ad for this page is because I had recently visited, that's the ad for the architecture of open source applications. Now the reason that Facebook wanted to show me this ad and try to get me to buy this free ebook was because I had just visited lulu.com where the print version of this book was for sale. I seriously do not know why Facebook thought it would be a good idea to show me an ad for 50 shades of the musical. So here let me show you the web page for at lulu.com of the architecture of open source applications. Now if you look carefully in the upper right hand corner you can see a little icon for an ad on that I added to my Chrome browser called ghostry. What ghostry does is it examines the web page that I am looking at and tells me all of the tracking services that are tracking me as I read this web page. So ghostry has I think 30, 31 trackers that are tracking me. And I hope by the end of this talk you will have resolved to go home, install ghostry into your browser and start looking around and becoming aware of all of the advertising networks that are tracking you as you go to different websites and especially as you go to your own institution's websites. What ghostry also will give you a report on each of the tracking services. This particular one starts with app nexus which is one of the largest advertising networks, burst media another tracking network and down the list 31 of these things. So part of what lulu.com is making money at is providing services to help advertising networks like Facebook sell you the stuff that they have on their website. So if you use Chrome you can turn on Chrome developer tools which will tell you all the requests that your browser makes to build this web page. And so this is the request that it's making to Facebook. Way too small for you to see, sorry about that. And it tells you all of the information that lulu is telling Facebook about you. So the address it sends the request direct to Facebook and Facebook connect advertising network. All right it gives the full address of the page that I'm on at lulu.com that usually for reasons of search engine optimization contains the full name of the book that I'm looking at. It's the referrer header and it also sets a bunch of tracking cookies. What the tracking cookie does is it's set by Facebook and whenever I go to a website my browser reports to Facebook whatever is in this tracking cookie. This particular tracking cookie identifies me as an individual user of Facebook. It gives information about other places that I've been. Usually I had been on the Ted Cruz website that I'm telling to Facebook too. And by the way my browser also sends the do not track header to Facebook. The do not track header tells Facebook that I do not want to be tracked. Facebook laughs at my innocent naivete about advertising networks and tracks me anyway. Not like the library world is all that much different from a commercial service like lulu.com. This is looking at the ghostry report for Worldcat for exactly the same page in Worldcat. Worldcat sends all this similar sort of information to add this to Omniture which is part of Adobe which Adobe does various tracking to help websites optimize their delivery of content including many organizations other than OCLC. But I don't want to give you the impression that OCLC is an outlier in this by any means. I was going to put up near a public library but since I made a presentation to them a couple of months ago since then they have removed a bunch of the tracking services that were on their website. If I go to my local public library catalog this is the Bergen County Cooperative Library system or Buckles. They also use this service called add this which reports all my browsing information at Buckles to their advertising networks. They use a Polaris catalog but again it's not just one catalog vendor or one website. Let me tell you a little bit about add this since I've mentioned that both for OCLC and for Buckles. What they do is they provide share widgets. Those are the things that you click on a web page to share your behavior with Facebook or Twitter or any of the other social networks. Because they do this they know who people are because they track them to their, you know, if you ever use any of the add this buttons to post anything then you have just given add this your personal information and identity on these social networks. They're also notorious for having deployed canvas networking, no canvas fingerprinting which is a technology that allows them to track you even if you erase all their tracking cookies even if you turn off cookies completely they can still track you and they've used this technology to respawn cookies so that if you delete the cookie they track you across deletions and of course you can find them on 5,000 of the top 100,000 most popular websites they of course have a privacy policy their privacy policy tells you that they set cookies for their partners to help them track you on the internet and allow you allow their partners to deliver targeted advertising well so what we live in a post privacy age to some extent but I want to remind you of the story about the six blind men who when presented with an elephant came back with a completely different description one of them you know felt the tail and thought the elephant was sort of a rope like thing and another felt the trunk and thought he was more like a snake and another one felt the ear and felt it was sort of like paper another one felt the elephant stomach and thought it was a wall well it was only when they started talking among each other that they realized that they were all measuring aspects of a single identity once they were able to put that all together they were able to have a picture of what the whole elephant was about and that's exactly what tracking networks do now you know a library that's reporting information about web pages they're only you know they're not really reporting personally identifiable information about a user they're just saying what the book what books person is reading or interested in reading but because these social these advertising networks place their beacons across the entire internet they can each of them reporting back the identity of the user that is making all these requests the advertising network can determine the social identity it can determine what the person is buying who their friends are what contacts they make what websites they visit geodata that tells them where these people are and so by putting all of these pieces of information together the advertising networks assemble a portfolio of information about people that makes it easy for them to to to advertise whatever they're selling great well there are lots of advertising networks this is a list of advertising networks that are active out there some are bigger than others some of them are specialty networks but they all basically engage in a business where they are oriented to collecting data about users and helping advertisers sell stuff so I don't have the last slide because this was a different presentation but just to tell you I've been blogging about this at my blog go to helman and there's just all sorts of stuff going on and I would please ask you to think about what your websites are doing and not betray the trust the substantial trust that users are placing in you to use their personal information responsibly thanks good afternoon thank you for coming a lot to share with you in a very short amount of time just to wet your whistle to continue on what Eric was speaking on this takes what you're seeing right now is a literally a live stream of all the internet traffic going through the router in this room and throughout the hotel and this particular network I am not a hacker I am not even a developer I'm just somebody who's been very interested in this topic now for probably about five years and I've learned if on a scale of one to ten I'm probably using this particular tool which is called wire shark I am probably a two or three but what you're seeing here and we'll get I'll go to some in-depth pictures using some of my personal searches in a moment but this is a live stream that with a little bit of knowledge and there's a lot of documentation you can go down to actually what is being sent over the network I'll let us see several people Roger and others taking pictures it'll make it nice and big for you you're welcome Roger so that's that now go here oh the follow also the follow up on what Eric was saying so nicely what you're seeing here we close this out we close this out what you're seeing here is something called light beam and this works for this works for firefax and you're seeing all the websites and the third-party cookies and third-party trackers that Eric was talking about visualized using the triangles and circles circles are sites that I visit on this computer so here's my own site info docket and that set and that sent a cookie or script set a cookie on my site because I embedded something here for example is a YouTube that sent a set a cookie to restaurant org for some reason the big circle here is Google Google for example was informed of my visit to this wincap org site and so on and so forth so you can see for example just by going to this is Ajax.google.apis.com and size server.org was this site Google APIs.com was informed that I was at size server.org to get the necessary data what each Google cookie is and what it means is you'll probably also know government EU law did they actually have to tell you that they're using cookies and I don't believe that a government mandate is needed in the United States and North America for libraries to do more and to share more and to be more transparent the library community has wanted everybody else who has dated to become as transparent as possible we need to do the same and let's talk again about a quick live demo before we get to the next speaker so we saw wire shark this is is there anybody from Iowa state no this is a tool called cookie cager and what this is doing is collecting all the cookies that are flying over the internet on a particular router or through ethernet and as you can see here every person on the network whether you're on a mobile device or or laptop every device has a Mac address a unique identifier this will actually pull that out of course you could do it with wire shark but this makes it a lot easier so here's the Mac address for my computer this I put this identifies me and now you can see all the non encrypted data that goes from my particular computer over the internet non encrypted so just to give you an idea save time here I did a search with Peter's permission of New York Public Library's OPEC and as you can see here with my search oh and by the way as you can see cursor over I also know the type of browser the person is using and I know some other technical details about the computer but as you can see here I can actually double click on this and I can actually see the search the actual search terms that I used searching NYPL.org and Marshall will get into more about other vendors but this is the case with a lot of database and OPEC vendors it is going over the internet so if what Peter said in the at the beginning if the Adobe issue was it correctly was an issue for the library community this also should be as much of an issue if not more because it's happening all day every day not only from OPEC vendors and OPEC databases but also from from vendors from from database vendors so that's an idea of what it looks like a lot of times too there is personal search information that sent through the different analytics let me go up here to a couple slides and I'll finish off with here one thing you know is Google books for example the Google book search is encrypted as HTTPS but when you actually click on a Google book entry it's not so here you can see that I was looking at a book the front cover of a book in the search that I did to get to the book was the Chicago Cubs my lonely hometown baseball team so you so Google the search is encrypted but the actual Google book entry page is not here is so here's another example of that I did a search there it is Flintstone in some cases when I when you saw scrolling earlier in my presentation through wire shark very often devices are named for the actual person using it it's often set that way by default so I was gear so on my iPhone this morning it was set up I changed it just to show you I was a library that Gary I then changed it and I became Fred Flintstone so many so for those of you who are using different types of devices your name is all your actual name sometimes first name sometimes last name sometimes both is being picked up by different device by different sniffing tools like I was showing you and then here's another example from a specific vendor you can see the DOI is being transmitted so now I can find out the particular article that I you could find out the particular article that I was looking at because again the magic Mac address is unique to a specific person to a specific device and then finally I was looking at a large well-known vendor of legal databases and actually I was I was working yesterday in a large coffee house and somebody I'm there was actually searching and I blanked it out but they were actually searching I could see what legal database they were searching and from the particular vendor in some cases I was able to see the client ID number and then finally here's Google analytics in this case from an academic library at edu through Google analytics I can actually see the search somebody was doing so all of this is out there and I think that as you've heard we all need to do more to inform our users and to work with vendors to make it more to make it more difficult for them to access thank you so my part of the panel is to talk about the state of the art or the state of practice when it comes in the way that basic library automation discovery systems treat library privacy and data at least kind of at the network and security level some of these questions have to do with privacy some have to do with security you know I think they're pretty tightly interlinked in a lot of ways it's you know it's incredibly startling to think about anything that you transmit on your from your library's web page over the internet whether it's wireless or wired is kind of out there for public inspection so as we think about the privacy that we provide for our patrons our community members through our systems it's important to pay such close attention to how well we guard that information so I did kind of a little mini study for this as I often do I kind of picked a handful of the major library automation vendors and you know I'm really delighted that they're always kind of able to kind of tell me things about their systems that then I also kind of verify through kind of my own knowledge and and other folks that use those systems so I invited you know this cast of providers to respond to a number of questions that that I came up with that kind of target the the question in mind and think of this is just kind of an introductory study of the conversation I think that we should do I plan to do a lot more in-depth investigation in the topic so you know this is just I don't want you think that you know I've done a already a comprehensive study there's a lot more left to be done you know and it only looks at one piece of the puzzle we're mostly looking at from the patron facing interfaces online catalogs and discovery systems and then basic handling of the way that libraries kind of manage and store the data so I sent them is about 25 questions I guess in these different categories you know the ones I think are most interesting important as the first set having to do with kind of patron facing interactions and then you know because staff networks are vulnerable as well the clients and then how our data stored in library systems is encrypted or not so if somebody does manage to compromise a system and get the patron file is going to do them any good well yeah it is it's just all in the clear or is it encrypted and and very difficult to break so those are the kinds of questions that I asked and kind of even floated the idea do we need to have some kind of security compliance framework so that library providers and library users can agree on what the standard the way that data are going to be treated within any of these systems so I think that the the gold standard is that we encrypt everything but library catalogs come from a longer history where that wasn't considered necessary I think that you know we we know that we have to encrypt things like the handshake that you do for authentication you know nobody would have a user sign on that didn't have some kind of encryption for that process and some kind of hashed password salted hash and the way that that stored just as you gain access to the system and some kind of only do that and some do others so it's interesting that there are very few that have comprehensive encryption and it's enlightening to think that the one that does it is one of the smallest vendors there's this vendor called Biblionix has this public library system called Apollo is a relatively recent multi-tenant software as a service offering but they encrypt every page that they deliver to both their staff and their patrons and as far as I know among the ones that I that we're talking about here none of the others are doing that but you can't when you go to Facebook when you go to Google you know they have gotten to the mode of encrypting everything and I think that's what libraries have to get to as well Billy a comment says that they're going to be in that mode in 2015 but pretty much all of the others say you know we think about you know is it have patron details does it have sign-on information yeah we'll encrypt those because those are important ones those are the sensitive ones well you're right you don't want to expose a phone number social security number pan or password so we encrypt those but the entire stream of a search describes what that community community member is searching for what they looked at and what they read so in the same way that we wouldn't turn over circulation records to anybody who asks you wouldn't turn you wouldn't expose all of that in the network in the clear so I think that's kind of where we're getting to now is the idea that the whole thing has to be encrypted if we really want to take the idea of patron privacy to keep patron privacy on the network these days so you know selective and then some are all or nothing you can as best as I'm the open source systems you can say well we will encrypt everything or we won't but it's kind of leapt up to the library or the provider for that library and my observation is that the practice is that mostly it's in the clear you know I searched tons of library catalogs and in my research and you know I very rarely see examples of where the whole data stream is encrypted I'm going to skip over all of this let me just say that I can I did this study I tabulated it I'll I think I'll publish the the results of that in my next mart libraries newsletter so that'll be available to a latex source soon and that it'll be available after the embargo period on my website so the staff functionality deals with even more sensitive information more sensitive to the library as well I mean our financial information accounts and all those kind of things we certainly want those to be carefully encrypted from kind of a business you know continuity point of view and business details and money and accounts and all of that so is that handled by the application automatically the state of practice is very uneven there are some say well if you're worried about that send it through a VPN you know in general you know that's not a you know the best answer you can do that that will work but you know most of us who use VPNs know that it's it's kind of a hard thing to do and it's kind of tricky for remote access the other question I asked was how are things stored internally so in other words if someone does compromise your system is it going to be kind of easily available information or is it encrypted again it's very spotty and I would say it's more than spotty there's mostly a practice of non encryption except for passwords and things like social security numbers but general patron details or address and all of that even if it's transmitted on the wire with encryption it's probably stored on disk in the clear so again I think that there's concern for all of that data comes in and out through systems on the back end through application programming interfaces and library protocols like in sip and sip which are sometimes secured and sometimes not so I think we have to look at the leakage of the system that can happen on the back end that often is still transmitted over a network often unsecurely so one question I floated is that do we need to have some kind of compliance framework informal formal I mean I was too early in my thinking to even come to any conclusions but in the same way that we know we wouldn't transmit credit card information over the wire without a certain agreed upon set of controls being in place what are the controls that we need to have as we transmit patron data of any kind including session data over the wire so we need to educate ourselves about what the issues are I hope that's what part of this panel is and then start thinking about what the methodology would be for defining those questions and a certain compliance or not so I already said that will make these this information available but you know the meat of the question is what are we going to do about this I think it's really critical that if we take the library ethic about patron privacy seriously then we really have to adjust the way that we treat patron data on our networks in order to ensure more security and the basic step is just end-to-end encryption to have comprehensive SSL for all of the transactions that libraries conduct between their users and the public so I would like to see more awareness and there's a lot of ability to do things about this within the configuration of the products we've talked about and in the vendors kind of offering this as kind of the default delivery of these systems we need to know to ask for it and I think that's part of what we what we want to do and I would say that there's no time like the president I think that if there was kind of an insistence from the library community this is something that the vendors could deliver in pretty short order wouldn't it be reasonable for in the same way that Google and Facebook changed in 2013 to comprehensive encryption of delivery of their services that the libraries make that a goal let's do this by next year so there's nothing technically hard about that there's the computing power to do it it's just kind of the knowledge and the awareness and and the urgency to get that done so that I think is kind of my ending point and the other thing is that no matter how secure these systems are we keep inserting them with none insecure things like you can take a perfectly secure library catalog and then put all of these added and other kinds of widgets in it that then kind of corrupt that so it's not relying on the vendors and the online catalog but for libraries to do comprehensive audits of the systems that they assemble out of all these pieces and parts in order to understand the leakage the exposure the security and the privacy so I think that's my time and thank you okay with that we are over so thank you very much for your attention we really appreciate it