 So hello, let me introduce myself. I am David Lacey. I'm the director of library technology at Temple University in Philadelphia presenting today with Cody Hansen from the University of Minnesota And we will be speaking on The title of our talk is collecting correlating stitching enriching how commercial publishers are creating value by profiling users So how did you do that? clicked Try it right here. Try it again now Good. All right part one The patron data firehose But more specifically our institutional single sign-on systems. I Mentioned in the abstract some specific technologies that that that facilitate this this this practice and Because of recent events. I was kind of forced to embellish my talk and cut some things out This is going to be one of them But essentially if you're not familiar with how this practice works The the single sign-on essentially acts as a a bridge between Identity provider and a service to facilitate authentication and authorization In a manner that does not involve passing credentials around The the authentication is governed by this notion of trust you establish a relationship relationship between the identity provider and the service provider and Then you would treat it you agree to trust each other It's it's further simplified by the introduction of trust federations where instead of setting up this trust on a case-by-case basis You can join a federation and everyone agrees to trust everybody. So that's my 30 second review of how that works Authorization though Is is the part of the process that that governs what what an individual is allowed to do once they are on on a on a service platform and There's a metadata profile that accompanies this transaction that that that contains data That defines who the patron is and what they are allowed to do and it's the the nuances of authorization Where things start to get weird? This metadata profile often contains names which are useful to consuming applications for personalization Your email address which is often times used as an identifier and in cases at least within my institution according to our identity management director Having the email is quite useful in you know in the example that he gave me was dealing with our corporate license for Microsoft if a particular patron is having trouble Installing the application on the cloud They they can then work with the individual directly without getting involved with corporate IT and Then also include in them the metadata could be your institutional affiliation whether you're a student faculty or staff alumni, etc So this whole notion of using single sign-on technologies in In the context of online publishing and commercial publishing was introduced to me at a an RA 21 meeting about a year ago Our a 21 is a is an initiative a nice so led initiative to improve the user experience of patrons accessing Online resources and the group is largely comprised of stem publishers. There's some exceptions to that But it's mostly publishers and it's led by nice so more information on this group And who they are is going to be at the end of the talk in the links and one of the recommendations within this program is to Leverage the single sign-on technology to access these resources in in in replace of the existing Implemented the existing practice of using IP based authorization, which is through our easy proxy systems and my immediate takeaway Not even the takeaway like within 10 minutes of this meeting it became clear to me that If we're if we're going to go down this road of implementing, you know, this level of trust with commercial vendors We're going to need to it's going to require a substantial amount of discussion and planning to define enforceable policies But since this initial meeting they've completed a pilot. It was completed last fall again There'll be a link at the end of the talk Where they tested their new UI and they established an SSO integration with Elsevier In the pilot the vendor Elsevier negotiated the transfer of patron information in the form of first name last name and email But they also indicated that other fields would soon be required for departmental billing Different trading between employee types and for granular usage of usage reporting So it's more than just people's names And lo and behold it all worked. It was a great success, but of course it did because these things work. We use them all the time So More important than privacy or not more important, but privacy is obvious. We all see that Let's talk about money, right? So privacy is a concern money is also a concern because data is money Sorry, let me get my notes You know data is this gold standard right and user profiles and usage patterns are responsible for an explosive financial growth within technology companies I'm not going to name names. We all know who they are how much money, right? Like that. This is this is the hard question like what what what is this data worth and Analysis of existing research states that data is not worth much at all At least in its isolated in in the small bits. It's not really worth much But it's it's it's it's it's value can increase through a variety of factors The nature of the data. What kind of data are we talking about like data involving health care? at the personal level is extremely valuable which is frightening and When the data is aggregated and enriched with other bits of data, it goes up You know and are like think of our use case having somebody's name in their email address Isn't necessarily a big deal, but when you look at their names and email addresses And then combined with what their browsing history is it gets a little bit more complicated Cody'll go into some other examples later that get a lot more complicated But ultimately, you know, it's impossible to determine this number a lot of like what governs the value of this data is protected by trade secrets But interestingly the data's value can be inferred By how valuable the vendor perceives it to be So historically in order to get your hands on delicious data, you have to do some things You know the free or discounted access to online services, so think of things like airport Wi-Fi or or or Gmail The free or discounted access to online content think Spotify and Netflix and The free or discounted offline services This one is really popular with insurance companies health trackers Is anyone ever plugged in a little box to their car that you know tested how how good of a driver? They are I did that not whatever it was complicated So recent events I cut some things short a lot of things have happened in the past two months that that we really Talked about because it relates specifically to this First of all, nice those evolving narrative, you know my early criticisms of RA 21 Where we're I think misapplied and I've definitely softened that RA 21 is just a group trying to figure out a way to do this You know, especially from the NISO perspective They're in a difficult position because they're playing both sides You know, they understand that there's this privacy concerns that that that the community is really passionate about but at the same time Most of the people that are involved in the pilot and the project are corporate publishers who really want all this stuff But recently they've established a Code of conduct Jayant, I believe it's it's pronounced and what Jayant essentially establishes is that if there is going to be an exchange of patron metadata between An institution and a publisher it has to be contractually agreed upon and any other data that may or accompany that that That that SSO transaction has to be discarded So that's pretty nice Recent development number two Safari books online. We recently Negotiated our contract where we had two options to Option number one use the fancy news system that allows you to do Offline reading in exchange for using that platform. You have to use your SSO No other way about it or we can continue to use the Jackie old platform use that uses IP based authorization. So This is a great business model, you know in exchange for having a better platform We had to give them more data than they probably need We were unprepared to handle this negotiation Recent event number three Stanford's recent privacy statement and subsequent scholarly kitchen article pretty much stating that yeah We're not doing any of this using SSO for for online databases is You know something that we're not even going to entertain and Fourth on the list is and I think they're they're presenting now is the the spark group recently declassified their Landscape analysis that is essentially Detailing the the evolution of commercial publishers from being publishers to being data analytic firms and It's this last one. I think that that that has really Not really shifted my thinking but it's giving me more things to think about Because now we're talking about a tale of two businesses when this whole thing landed I was thinking about how what is the outcome going to be when commercial publishers can Embellish their existing product line with new data and I might my in my darkest hours of paranoia I was thinking about you know, what what impact it could potentially have on on on tiered pricing models, right? Everything now is strictly anonymous. It's how will how will pricing models change when they know it's a student or a faculty member? Or what not but now Given that, you know, you know the advances made within Elsevier and people doing massive analysis on their corpus that they maintain What is the impact of this new data going to have from the perspective of a business analytic firm again? We mentioned earlier that the value of data tends to increase when you can associate it with other data and Now we have a lot more data So just to recap So all of these technologies That that implement this kind of communication are complicated and they vary from institution to institution I highly recommend everyone educate yourselves I've been spending a lot of time with our director of identity management and I wish I had some very fun like Uplifting things to report Implementing these SSO SSO technologies with per commercial publishers raises privacy concerns pretty obvious ones You know, we hold browsing history to be fairly sacred and now we're looking at a landscape where Other people will have to maintain that same sacred trust Implementing SSO technologies with publishers slash data analytic businesses represents a new line of revenue and one that avoids Established practices regarding its value They're circumventing common convention that there's a trade-off in gathering this kind of data They're not offering a cheap alternative platform to consume published goods. It's just it's a gift-wrapped bonus But in a lot of ways it's sort of convenient because it it helps it It sets it so that we can avoid having a very uncomfortable Conversation about the fact that it is value and whether we hold or create leverage on that value God forbid think about selling it, but it is valuable and it's The assumption that will just give it away avoids that conversation But ultimately all of these concerns fall squarely on the backs of the identity providers aka the institutions The people that choose to pass through the information to vendors and That kind of relates back to point number one in education and figuring out what we're currently giving people as we house things are currently configured Here's some helpful links There's Stanford's statement on patron privacy the giant announcement Artillery one pilot and the spark landscape analysis Thank you. This has been helpful. We're gonna Try to have time for questions at the at the end. Thank you All right. Thanks David Thank you all for coming. I'm Really happy to be here to talk a little bit about some work that I've been doing over the past few months that came directly out of some Talks that I saw at the December CNI meeting and out of subsequent conversations with David among others So I'm going to talk about user tracking on academic publisher platforms Before I go too far, I just want to acknowledge a lot of colleagues who have helped Sharpened my work here over the past couple of months Either through conversations editing or through their own work in in similar areas I also want to tell you that I have a lot that I'm going to try to get through here in a short time And I do want to have time for questions So if you want to heat to read a lot more about what what I'm talking about here. I've got something online here I'll put this link at the end of the talk as well I'll also put the slides there and This talk is being recorded. So it will be available on the CNI site. I'll also put it on on my site here as well so I Could go into great detail and left my own devices. I would so lest we run out of time I want to give you the highlights the key findings here up front and I know I apologize for those of you in the back The type is small here. So I'm going to give you a short rundown here I found that the articles most frequently used by patrons at the University of Minnesota include code on their publisher pages That is designed to identify users and to link their identity to the pages that they visit I found that these tools derive user identity in part through metadata that is not currently governed by our typical definition of personally identifiable information and The conclusion that I have come to is that I do not believe that it is currently possible to ensure that Use of electronic library resources can be private So I'll talk you through that now Little bit of background. So I mentioned the December CNI meeting. There were three talks there that were really impactful for me in in this work the first was a talk from Kenning Arlich and Scott Young Where based on an article that they had written with some colleagues where they did some automated analysis of library homepage source code Looking for the presence and proper implementation of privacy protection measures The second was a talk by Micah Altman Lisa Janicki Hinchliffe and Katie Zimmerman Where they did a very detailed analysis of publisher platform terms of service and also Looked at some web tracking mechanisms And the third was a talk by Todd Carpenter from NYSO Jean Shipman of Elsevier and Ralph Young and from ACS about RA 21 And there was a moment in that talk that really crystallized this project for me and that was when in response to questions from a bunch of dudes about About privacy and RA 21 Todd Carpenter in an attempt to reassure us that RA 21 was not in fact a grab for Personally identifiable information said and this is a paraphrase Publishers don't need RA 21 to identify users. This was intended to to allay our concerns. I Left the room concerned So here's what I did when I got home to Minneapolis in January and February of this year I embarked on a very simple study Trying to answer the question can an analysis of the source code of publisher platform pages much like the folks at Montana State did Provide evidence of if and how publishers can identify library users To prove if if Todd's Todd Carpenter statement was was in fact true The answer is yes And here's how I how I went about this I looked at the hundred most frequently accessed articles at the University of Minnesota So we record DOIs that pass through our easy proxy server and have for a couple of years And so I took the hundred DOIs that appeared most frequently in our easy proxy logs over a couple of years and Those hundred articles came from 15 different publisher platforms As an aside, I think the fact that there were only 15 Platforms represented in the hundred most frequently used articles at our library is its own problem that is probably worth Further conversation, but I don't have time today For those 15 platforms I took one representative article the most frequently accessed article from that platform in our in our logs I resolved the DOI through DOI org from an on-campus IP. That's part of our Authentication range with each of those publishers. I captured a complete archive of the page including the first and third party assets and all code and scripts That come along with it. I read the source code to the best of my ability I will note that one platform that I looked at at random shipped over 60,000 lines of JavaScript to the browser So that's why I say to the best of my ability and then I analyze the live page with ghostry You may have heard of ghostry before it's a very handy web browser. There are others like it Sorry web extension rather browser extension. There are others like it, but what it does is allow you to block on known web trackers And so here's a screenshot of ghostry running on the website that my team maintains just to show you that my hands aren't entirely clean Ghostry finds these third-party assets and blocks them if you want if you wanted to I set ghostry to not block anything But instead just use its sort of user-friendly display of the third party Code on the page. This is I want to emphasize how simple this research was this is well within the grasp of any library staff member Here's what I found. I found that on average each of the 15 platforms had 18 third-party assets being loaded on their article page The median was 10. There was one that had none. There's anyone here from inform pubs online kudos On having no trackers on your platform one had over a hundred And I found a total of a hundred and thirty nine distinct third-party Asset sources across these 15 platforms What is the significance of third-party code? Why do I care about it? Why am I looking for it? JavaScript that is loaded onto a web page can access the following things the page address the page contents user actions on the page browser info the user IP address Contents of existing browser cookies. I'll get into that JavaScript can also load Java load additional JavaScript from other sources When you're talking about the page address the page contents user actions on the page in the context of scholarly article this reads to me as Informate in sort of ALA patron bill of rights parlance as information being sought or this is One half of what we try very hard to protect user behavior user interests user research We try to protect that when it is combined with user identity information So under our fairly common understanding at least it's true of my institution of what constitutes personally identifiable information This isn't a big deal. We don't consider IP addresses to be personally identifiable I think there's argument for reconsidering that but By loading third-party JavaScript Publisher platforms are effectively sharing the content of user research inquiries with third parties along with information that can and I Would say will be used to specifically identify the user to bring those two things together that makes us something that we Typically would try to protect So how does this work? Let's take the example of Facebook four of the 15 platforms included Facebook code on their page and So on sites with Facebook code on the page We can assume that the identity of users with a Facebook cookie in their browser That means if you use the remember me on this computer or save my login function That when users with a live Facebook cookie in their browser visit a publisher page that has Facebook code loaded on it Their visit to that page is going to be stored and attributed to their Facebook identity You may have recall in a couple of months ago in the news Mark Zuckerberg testifying on Capitol Hill about and there were questions about shadow profiling as a Practice at Facebook is doing this is Creating profiles for people who do not have Facebook accounts based on information from other sources And because of some of the information that came out around that hearing we can assume that on sites with Facebook code Users without a Facebook cookie in their in their browser That the information about the page that they are visiting is likely being combined with a shadow profile or being used to create a shadow profile behind the scenes Google 14 of the 15 publisher platforms included Google code Likewise here. We can assume that on sites with Google code the identity of users if you have a Live Google cookie in your browser your identity is going to be combined with the information about the page that you're visiting and stored by Google and I'm trying very hard to keep this as factual as possible and to point out when I'm editorializing or making assumptions Here's an assumption that I'm making. I assume that the same holds true for users without a Google cookie How does that happen how does how does a shadow profile get created? And how do you get information about your identity stored even when you don't have an account or a live login with one of these third parties? One way is through browser fingerprinting This may be a technique that you are familiar with but if not I'll just mention that it's a way to generate a unique identifier for a user when you don't know their login information They don't have an account with you And it takes metadata from your web browser that is sent by default to the web server and Effectively creates a hash of it some you know, especially if you work in digital preservation You may know about Cryptographic hashes as a way to uniquely identify digital items and that's what's going on here So it's taking what looks like very benign information and when it's combined together it becomes Remarkably identifying so as an example This is a screenshot of my visit with the browser that I use most frequently to the EFF panopticlic site Where it showed that of the visitors to their page in the in the past 45 days my browser matched only one in over 100,000 browsers It's my browser is fairly unique not doing anything too interesting to make it so but I Will point out I'm going to go back here I will point out that if you do things like enable do not track or Install privacy protecting plugins to your browser. It just makes your browser more unique and makes you easier to track unfortunately Browser fingerprinting and this kind of shadow profiling are not just the province of major social networks and ad networks There's a class of tools all refer to them as audience tools The you may have hear them referred to as data management platforms or digital marketing platforms And in fact the the title of our session comes for a promotional video from one of these tools So they talk about how they you know collecting correlating stitching enriching. It's about how they combine these tiny bits of data And metadata with other data sources with the express purpose of deriving user identity I Don't expect you to read this is just this is just an illustration So here's a company called new star and a couple of things from this page In today's connected world where consumers move rapidly across devices and touchpoints It's time to stop guessing and start knowing with accurate and verified customer identity data Over a hundred and fifty million households compiled verified and enhanced with 450 plus fields of demographic behavioral financial property segmentation and geographic assets At least four of the 15 platform publisher platform pages included new star code New star claims that their One ID system their profiles for users Are Recoraborated every 15 minutes and that they collect 11 billion points of data every day So that's one of these audience tools or data management platforms This is a screenshot from a marketing video for another tool called Adobe audience manager You'll you'll just note that this shows a screenshot of their demographic screen with age and income level also spaces here for gender purchases social At least six of the 15 publisher platforms included Adobe audience manager code Adobe claims that audience manager can turn fragmented data from any channel or device into meaningful audiences that you can act on right away Can be used to deliver offers only to users when they are logged in or based on previous login activity So when someone is not logged into your platform, you still know who they are And they advertise their ability to enrich the data that you collect with data Purchased from other brokers such as axiom, which has comprehensive consumer data on approximately 250 million us addressable consumers That's pretty much everybody The third of these audience tools or digital marketing platforms that I'll mention is Oracle marketing cloud for the 15 publisher platforms included Oracle marketing cloud code Like these others, they are very proud of their ability to connect a user across devices and across sessions They claim that their Oracle ID graph can reach over 90% of the people online in the US and Where do these data management platforms? Get the information that they use to Build these data sets the metadata the browser fingerprints things like that Well, at least some of it comes from our patrons use of library resources 11 of the 15 publisher platforms included a tool called add this Add this is a script that gathers information about the user and their activity and shares it with a network of over 40 different Advertisers and data brokers including new star Adobe Oracle and Google So publisher platforms send data to these data brokers Who then use it to help publishers and ad networks to better identify and target users on publisher platforms? I've now mentioned six of the hundred and thirty-nine different Sources of third-party code that I found on these 15 platforms any of these hundred and thirty-nine tools is technically capable to similarly surveil users and We have to assume that many are So this is the complete list of the add this partners as of February Highlighting the ones that I've I previously talked about I'll note that as of yesterday New stars site featured a story about Etna's successful use of their technology And I'll note here that the top hundred articles that I started with at the beginning of this study Included topics like childhood obesity and cancer treatment and I don't expect that our users Anticipate that their research on health topics will ultimately be used to create a profile on them that will be shared with their insurance company likewise do our Do our users expect their research behavior to be shared with eBay and Combined with their bidding activity Samsung is a partner here Do they expect their research behavior to be shared with the manufacturer of their television for the purpose of better showing them ads on the television? Home screen there's at least one publisher platform that directly included Samsung Advertising code on their page at least one platform included code from LinkedIn Do our users expect that their research behavior is going to be used to help? Target advertisements to them in their career networking site So I do not believe that it is possible for use of licensed resources to be private the tiny bits of information that are being sent to dozens of Third-party platforms every time our users access an article Will be used to identify them our idea of personally identifiable information has been totally outstripped by Moore's law and cheap storage So that now we have to assume that every tiny bit of information That can be collected about a user will be collected about a user and will sit latent until it can be Until enough information can be aggregated around it that that user can be personally identified so I am Really heartened to see some recent sort of institutional attention being given to this new privacy landscape L.A. Patron Bill of Rights article 7 was approved at midwinter This reads as pretty aspirational to me given what I've looked at here Because I think it's fairly safe to say that we are not presently protecting privacy and safeguarding user data library use data rather Likewise the Stanford statement that the David referenced earlier. This is an excellent statement is powerful What I don't know is that it is if it's true at least in the sense of a present tense of Reject because unless the code that's being shipped to the signatories of this letter, which it's possible. It could be But unless the code that's being shipped to users from their libraries is substantially different from what's being shipped to University of Minnesota libraries patrons They are not They are in fact silently exposing user data to third-party interests. I suspect that you know this this Statement was intended to apply primarily to things like the safari books online to single sign-on things like that, but it's broadly stated I'm gonna go back here for just a second and just say I am concerned when we tout our commitment to privacy and our values around privacy that we don't give our users a False sense of what the actual privacy landscape is and I believe that we are in that position currently So there is some effort underway to build model license language around some of these concerns. So again at Lisa Janicki Hinchliffe and Katie Zimmerman at the December meeting talked a little bit about a nascent effort there Then I'm really excited about I would finally just encourage all of you here all of you listening at home in the future on the video This is very easy to do And I think it will reveal a lot about the current landscape of your electronic resources I would encourage you to take a look at it yourself And with that here's a link to the longer write-up contact information I will be around the rest of today and tomorrow happy to talk about this with anyone and You got time for questions