 Thank you. My name is Sarah Bird. You can find me on the internet at bird Sarah. I was just saying the more I talk, the more I seem to get increasingly nervous. So I am melting on the inside. So if I make mistakes, please forgive me and please feel free to shout out if you see something insane. I'm a research and experiments engineer at Mozilla. It's a position I joined just a few months ago at the beginning of March. And my life was a happier, more innocent life before I joined Mozilla. As mentioned, I'm also a core maintainer of Bokey. I must confess I haven't contributed any code in a while and I hope to get back to being more active in the community. So I don't want to wear that label too loud, but I am a huge fan of Bokey and you will see a bunch of Bokey in this presentation today. It is objectively speaking the world's best plotting library. All of the slides are up on my github. A few quick public service announcements. Mozilla, we make Firefox. Most people know that. And yay, Firefox! And we make lots of other cool things too. I have been surprised since I joined and talked and spoke to my friends about joining Mozilla. A lot of people don't know that we are a non-profit, so I just want to say it again. Hopefully this crowd knows that we are a non-profit. And our mission is a healthy internet. My other public service announcement is for the NumFocus, the amazing organisation NumFocus, which incubates all of those projects on the left and is affiliated with the ones on the right, and helps with the PiData conferences. There are two PiData tracks here and it's an amazing part of the Python ecosystem now, so I just want to give credit to the amazing organisation and work of NumFocus. So, and today we're just going to see a small fraction of that universe. We're going to be putting to use some Jupyter. Everything that you're seeing running is running in Jupyter notebooks. Pandas, Bokeh, the world's best data visualisation library. Dask, Scikit-learn, a little bit of PySpark and NetworkX in the background. And a shout-out to Condor and NumPy because you won't see either of them, but they are those core technologies that are underlying this whole ecosystem and we would be nowhere without them, so I always like to give them a little bit of recognition. So, right. The web is terrifying. There are unseeing eyes all over. Oh, I just lost a... There we go. There are eyes all over the internet watching us whenever... The monitor here is blinking on and off if somebody knows how to make that not happen. The unseeing eyes that are following us all around the internet. So, how does that happen? How are we followed around? How many people here have heard of a cookie? Good. It is, as we all know, a small piece of data that a server sends to a user web browser. The browser may store it and send it back. And it's a very simple piece of data. It's a key and a value. They say that naming things is hard. I'd like the name Temp Persistent User ID. And it's affiliated with a domain. In this case, the Harwick's website. So, let's take a look at... Let's take a look... Oh, my goodness. At... It's going to get tricky. Everything is flashing on and off and on and off. Okay. So, first, you need to find your Firefox profile directory. I've included all of the code for this in the GitHub repo, but I have not included my Firefox profile directory because it includes an incredible amount of personal and sensitive information. So, find your own and replace it. So, once you know your profile directory, you can use Glorious Python to list out the contents of that directory and see a bunch of SQLite databases in that directory. I'm going to just look at this screen for now. Apologies for not facing you, but it's chaos over here. Great. So, now, next, we can use Pandas, and we can use Python's built-in SQLite to open the SQLite database and have a look at what tables are in that database. And then now we know the name of that table. We can use Pandas and just read straight in from that table. And we're just going to take out for now just the domains that have set cookies and the timestamps when they were made. So, this was a real profile that I was running on my work laptop, and I've just cut it off for the sake of this talk for three months, from March 1, shortly after I joined Mozilla, to the beginning of June. Does anybody want to shout out a guess at how many cookies were lurking around after three months? It is definitely somewhere between 1,000 and 3 million. The answer was 3,000. And let's take a look at this with the World's Best Plotting Library, Bokeh. I'm not going to show a lot of Bokeh code, but this is what a typical piece of Bokeh code would look like. You set up a figure, you set up your data source, you can pass in a whole Pandas data frame if you want, and then you start drawing things, patches and circles in this case and configure it, and you get out something like this. And these are my cookies over that three-month period. Now, does anybody want to hazard a guess what happened around May 1? Say that again? I don't think any of the above. I decided to sacrifice my privacy at the altar of giving a talk and turn back on third-party cookies. So the first thing I'd done when I got my laptop was turn them off, and I thought, well, I'm giving a talk about tracking technologies. How bad can it be? Turns out, terrible! So one of the other reasons that I kept this to a three-month window is because once I put this talk together, I was like, this is horrible, and I started a fresh profile, so I'm no longer using this profile on my machine. So we can use some straightforward Pandas to just have a look at the domains that were setting cookies. 60 cookies over three months is like one every one and a half days, give or take. And although I look like the kind of project person that might go to Insight Express AI and stickyadstv.com, I haven't actually been to those websites, and I don't know what they are. So we can use, there's lots of great ad blocking companies and open-source softwares out there. Disconnect me provides a list that categorises domains for us into some useful buckets. And so we can apply, we can merge that on, and we can see that most of those high count ones that I don't recognise unsurprisingly are advertising domains. And we can sum them all up, and we can see about 20% of the cookies that were set probably mostly in that last month were advertising and analytics cookies. So a lot of those were not categorised by the disconnect me list and we can think about that a bit later. But there are lots of advertising companies out there, maybe it's not too bad. So if we explode that group by a little bit, we can actually see that within advertising there are a lot of different ad companies that have set one off cookies. And so maybe that's not that bad, there's just all these different players out there trying to make a buck in the world and maybe it's all okay. I'd like to introduce you to two little scamps that I call cookie-sinking and zombie cookie. Cookie-sinking is sad because zombie cookie has a way cooler name. And together they get together and destroy our privacy. So let's look at cookie-sinking first. This is the process by which different trackers, different players, link IDs that they've given to the same user. Stephen Englehart wrote a really nice blog post about this back in 2014. Or you can just go and read Google's docs about it because Google provides cookie-sinking as a service and it is really the backbone of the real-time bidding infrastructure that happens every time you get served ads online these days. But do we see this? Does this happen in practice on my machine as I'm going about my happy business? So we're going to take a really crude approach to looking at this. And we are going to start with our cookies database again. We have our domains and this time I've pulled in the values that are actually in the cookies. And I'm going to do something really crude and for every value in that list of 3,000 that I have set, I'm going to go and see if that value is in any other values. And then that's going to give me a filtered list and then I'm going to aggregate by the domains and I'm going to be super conservative and say that if I've got more than five domains that are appearing in here I'm going to add it to a shared value list. And then a little bit of futzing to get it down I decided to keep the IDs to be more than length 10 and to not have common them. And that quite quickly, very straightforward code gave me a set of 25 potential IDs and they definitely look like they could be unique identifiers. So here we can see what that looks like. You can see that that string, the 7620423 is in the values but it's not the exact value. And so can we see cookie-syncing? I'm a visual person, I'm a Bokey developer. Let's put this together. So let's use NetworkX to build up a graph and we don't need to worry too much about the details of this. We're going to colour the nodes of the graph. Each domain is going to be a node and we're going to have edges every time they've shared an ID. And so we've coloured at the nodes by those disconnect categories and we've coloured the edges by ID. Some beautiful Bokey code which I will skip but Bokey plots graphs and here we have it and we're going to take a little minute to look at this. So we have this group over here and these are all the wiki media companies and so they've all shared some kind of ID and they're all just off on their own there. They're in their own little group and that is not tracking. That is a completely legitimate use of sharing an ID and probably something that's quite useful and good internet behaviour and we shouldn't be too worried about it. And then we have this hot mess. At the heart of these hot messes are things like PubMatic which was, if you remember, high in my list of cookies that have been set and you can see all of the different colours coming out of there which means that it is at the hub and it knows about dozens of other different companies' identifiers and it means that when companies can buy and sell data behind the scenes they can then rejoin all of that data together because they all know about each other's IDs and you see it over and over again and you can also see on the edges of this on the outsides of these networks are websites that I've gone to like Wired.com, LA Times is tucked in here somewhere and they're companies that are participating and facilitating this ID sharing. But this is not a complete picture. That was a much smaller set of domains than are actually in this whole... They're in my whole cookie table and I used a very crude metric for getting there. Maybe you can... I'm sure you can do better. Feel free to do better and let me know about it because this is an interesting problem. But we can delete our cookies, right? Back to zombie cookie. Zombie cookies or ever cookies are cookies that are recreated after deletion and they do that by storing the information that's in a cookie in lots of other places and there are close to 20 other places that this information can be stored. But does this really happen? So we're back with our Firefox profile but this time we're going to look at our local storage directory and so in our local storage we read it in the same way and very similarly we see key value pairs associated with a domain but it's now local storage and we have our 20-something IDs that we got from our cookie table and now we're going to go and look for those and we find that one of those IDs from the cookie table has been stored in our local storage. And so an idea occurs to me. Can I bring a cookie back from the dead? Can I raise a zombie cookie of my own? So I created a completely fresh profile and all I did was copy across the local storage SQL lite database nothing else, it was a completely clean profile and then I went to the LA Times which I might add I pay a subscription to which makes it even more irksome. We're in our new profile completely clean we open it up and the very first time you create a profile there are just eight cookies in there and then we go to the LA Times and we've copied across the local storage and now what we want to see is did that cookie value from our old profile did it appear in our new cookie database and the answer is no. Hooray privacy not invaded although I must admit to being a little bit disappointed because obviously I'm writing a talk about tracking technologies if it had worked it would have been cool but then I'm suspicious and untrustworthy and I thought well I haven't actually clicked on a story yet so then I clicked on a story and there it was and as soon as that tabula content populated the tabula javascript pulled that cookie back out of local pulled that value back out of local storage put it back in my cookie table which means that all of those other third party cookies are then going to go up ahead keep doing that cookie syncing or even though you've completely wiped your cookies they have the means to completely link back all of your data well that's depressing okay sorry let me see nope oh boy right we're back I wrote this in here because I kept getting lost okay so cookies a cookie is a small file placed on your device that enables features and functionality sort of paraphrase of your average corporate cookie policy what they typically miss out is the second part of that sentence which is and enables us and others we enable to completely compile your entire browsing history do you care? I know I do I feel uncomfortable with the level of of sharing the domains I've visited already but when you start to think about adding that together with all of the stuff you do on your mobile phone your location, your smart TV going your smart fridge, drones, whoever knows what when that data gets stolen or when you start to be discriminated against or paying more for services because of this data being known about you and your browsing history can be and is connected to your real-world identity has anybody here never used social media? Daniel, my hero I don't know if Daniel is lying or not but Richard Stallman certainly hasn't but when we do that we create a very straightforward way to link our online behaviours and our offline behaviours but even if we weren't there is a whole industry called identity resolution which back in those happy days of January 2018 I had never heard of Axiom I'm not picking on them, they're just one of many companies that do this they do about a billion dollars a year in revenue and they employ what they call privacy compliant matching which I read as legally compliant privacy invasion where they take your digital identity your online identity and they map it across all of the different devices you're on and then they connect that up with your offline identity and then they proudly sell to marketers an addressable base of two and a half billion people I don't know how much of that is true but they're certainly worth a lot of money so fine you'll turn off cookies you'll never ever you'll type a password every time you go into a website you're never going to use cookies again but it's really your browsing history that is giving up your identity so is anything else giving up our browsing history the language we all love to hate JavaScript and so there are also lots of other things web becomes your ISP but we do not have time to go into that today so this is I'm going to tell you a little bit now about a project that my team at Mozilla, the systems research group has been doing we have Martin and Dave on the left if you can't tell them by their photos I used to think they were paranoid and now I realise that they are the smart ones of my team but it's very hard to find them online and in November 2017 before I joined the team they ran a crawl visited a million locations and recorded 131 million JavaScript calls so they went to a million websites and they'd set up an instrumented web browser to record a whole series of of JavaScript calls that are typically associated with tracking behaviour oh ok I will go fast and so you can read some of the results at text.mozilla.org which is where we have some students look at this data and start digging into it what this data looks like is not too complicated for every line in this 131 million data set we have the location we visited the location of the script that was loaded some other things and of interest for today the JavaScript API interface that we hit so something like window.navigator.userAgent I'm going to skip over my first aha moment in the interest of time but the TLDR is I realise that when you go to a website and you see it load a few little scripts you don't think too much of it but if you look at that from the other direction and you play the server and you see the breadth of websites that a single script server can see you start to see the reach that they have into an ecosystem and into your traversal of the internet when you combine that with behaviour things like cookie syncing and these other things you can see how people can build a really complete picture and I would totally recommend going to panopticlic.eff.org they have been shouting about this being a problem since at least 2010 and you can see how unique your web browser is and we're just going to look at one example of the kind of fingerprinting that you can do with JavaScript called canvas fingerprinting and there's a well-known open source library called fingerprint 2 that's available online and we're going to have a look in our data set at this and see if we can see this in action and so this is now using Dask because this data set of 131 million lines of calls is about 75 gig of data compressed so it gets really big and so we're going to use Dask which gives us a very pandas data frame like API but handles it out of core or in a distributed system so we're just going to read in a few columns in particular we're interested in the symbol column and within that we're interested in the fill text the fill text which is one way we can see what's happening in canvas fingerprinting that I'm going to go fast but you can see this is a very pandas-like API but it's running out of core to filter that data set for us and here are the results and I'm just going to skip out of this and go down and zoom in a bit and so here are the kind of things that we're seeing in this data set of going to a million websites and the kind of things we're seeing people write to canvases canvas is the same technology that Bokeh uses to render visualisations it's a great web technology and it's super useful but rendering a smiley face or this set of text or the very cheeky canvas fingerprinting is not being done to serve a useful purpose and to serve you content online and track you ok so yes this is really happening we can see it in this data set I have one minute remaining I promised you some scikit-learn and I'm going to skip to the end of this which is just a very beautiful visualisation that I made and so what I started to do is look at this list of that uncategorised data before we can go and look at existing tracking scripts and so on out there and say ok well they're using this JavaScript API ok well we can start blocking that but we're always going to be one step behind the people that are doing this so what I want to start doing is can we use clustering to sort of detect from crawls and from what we have in Mozilla can we see tracking and new tracking technologies coming to light and happening in real time and so this is a project that I'm working on now it's a work in progress it doesn't work yet each of these are sort of different clusters of every little square on every coloured square on here is a different part of the JavaScript API and they're coloured by the different flavours the outside is like window.navigator and the thing I want to finish with is is that I'm hoping you can do better and I want to let you know about that we have open sourced this data set and it is available at the Overscripted Data Analysis Challenge on Mozilla's github and we have a competition running through the summer to win prizes to come and present your work at Mozfest at the end of October tickets, airfare, all that is included and if you are thinking about getting into data science or doing more analytics or you're interested in tracking technologies this is a really untapped data set that we would love to see what you can come up with to help us fight the good fight at Mozilla against tracking technologies so thank you very much so thanks for this amazing task I'm sure there are some questions thanks for a great talk so you said you turned on your third party cookies and then you turned them off again does that help? is there a meaningful difference with them turned on and off? I just started a fresh profile that would be an interesting thing to do but what is one thing that's worth noting is that when I first ran that count of how long that data frame was at the beginning of June it was actually like 3700 cookies I then subsequently opened that Firefox profile maybe a month later and a bunch of those cookies expired and that number went from 3700 down to the 3000 that we saw today and so yes some of those cookies were expiring whether it was the good guy cookies or the bad guy cookies is I should find a gender neutral term for that anybody can be evil the you know I don't know but that would be an interesting thing to do to mess around with your own profile and see it happen thanks don't die thank you for the great talk do you know if the resource will be available later say that again? if the resource like all your jupiter notebooks yes they're all available at the look at the code that runs for example the clustering analysis that I was showing at the end that's not runnable code but the the stuff from the start and this over scripted data analysis challenge repository has a working get started with the data notebook in that I put together and so if you have any issues with it feel free to file issues or ping me on Twitter hi thanks for the talk very interesting I wonder if you can talk a bit about the features attributes that has been used for the classification of the type of cookies sorry for which type of cookies? for what features do you use to classify you mean like advertising content that I did at the start so that was just using the disconnect me list and so this is sort of bringing together sort of thinking about what I was doing at the end with the kind of can we get ahead of the arms race and detect these things in more real time a lot of ad blocking in fact as far as I know all of the ad blockers that you could use be they proprietary or open source today for the most part using lists some smarter some less smart but they're using lists to say block this block this block this block this and so it means that we're always playing catch up what I want to try and do is see if we can detect sort of good javascript the javascript that we want and that's desirable and the javascript that we perhaps consider undesirable and sure we can put that choice in the hands of users and then do some of that classification in more real time but for the moment the classifications you saw at the beginning of the talk are based on public lists that are out there from either ad block or easy privacy or disconnect me or lists that you can find online thanks for the great talk one question it's more or less related with that what features did you use for this last plot in where you did the clustering sure I skipped over that well I mean if I feel like you're just giving me an opportunity to do the rest of my talk I appreciate that that would have to be in 30 seconds please why have I got a blank screen of death oh no well there we go I guess computer knows I took the symbols so like all of the javascript api calls and I built together a kind of fingerprint for a given script on a given location and I said and I sort of just did ones and zeroes and so I ended up with every possible javascript api call became a feature in my machine learning feature became a column in that array and then I was trying to do clustering based on those columns and looking for similar patterns oh sorry I'll just add one thing I did the early reason I think that this was promising is I started to pick up things like it was finding a cluster of Facebook and FBCDN now it didn't know that those two domains are connected like have a business relationship but I was starting to see domains that I know have a business relationship with my human intelligence being picked out in these clusters so I do think there's something there although I have a lot of work to do on it okay so let's give a big hand to Sarah again