 Hello, everyone. This was a bit nerve-wracking, but fortunately the slides are working now. I don't know about you, but we're definitely very excited to be here. I want to thank the organizers again for having us here. We're from Germany. In case you were wondering about the weird accent and we're going to present the project we did last year. So I'm Andreas and I'm a data scientist. I did the data analysis for this project and I'm working as a data scientist for the German cybersecurity organization as well as for seven scientists. And I can now relax a bit because Svia will do the first part of the presentation. Thank you. I'm Svair and I'm a journalist mainly doing investigative reporting on tech topics. And yeah there was a topic coming up and just before we dive into it, I want to explain why exactly we are here. And this spring I've heard about a severe decision you have had here in the U.S. It was the U.S. Senate who voted to eliminate the broadband privacy rules that would have required ISPs to get consumers' explicit consent before selling or sharing web browsing data. And in the research we did just half a year earlier in Germany we learned in a very concrete and very shocking way what it means if somebody is selling your browsing data. So I just, yeah, went to take you with on that journey. Maybe you can imagine by yourself what you think if somebody is showing up on your door and saying, hey, I have your complete browsing history every day, every hour, every second, every click you did on the web for the last month. So that somebody was me. And at first, before I explained how I got that information, I wanted to tell you how we got the data and what else was up with that. I first want to show you a reaction from a person I first confronted with her data. She's a member from German parliament, she's a politician. And yeah, and I was sitting next to her. It's a video, so I'm not sure how this is working. Oh, I would love, yeah, you can hear it in German when she's speaking. So she's not happy. Yeah, it's okay. So that's what she's saying basically. Sorry guys, can we get the video rolling here? Can somebody help me for the next, because I need it? Okay, so this is what I showed her. This is not a video, this one, but the next one because we want to scroll and dive into her life a little bit more. So what you can see here is her browsing history. You can see that exactly what she was doing on the 1st of August. And you can see that she likes to get up early and one of the first things she's doing in the morning is banking and banking. And we can go through her whole browsing history and that's a video now. Let's try if we can get this rolling. Because what you are seeing here, oh, that's bad, because what you see here is her tax declaration. It's in Germany, it's called Elster. And you can do it online. And it's typically that you do it in August because then the deadline is running out. Can you help us with the video? Because I think it would miss so much. Is that the play button? No, that's the next. That's the next one to the right. Okay, this one is over there. Okay, that's okay. So what you think might be all my tax declaration, maybe that's bad. But what's even more worse, especially when you are in a higher position is if you are searching for medication. And that's exactly what she did on a day in August sitting in front of her computer, she was searching for medication. When do you take, when you feel dizzy or when you have a tinnitus? So yeah, what was it like for her when I was sitting next to her and showing her her browsing history and saying, why do you, why, why you have been searching for timonine? And she said, I don't know why I was searching at the time, but this is a really bad thing to see something like this, especially if this is connected with your own name. So this went on this story. So we basically had this data from three million German citizens and a lot of politicians or better, most of their employees were also in this data. And this even went up near the chancellor miracle where, yeah, where we had the data from, from this guys. So how we did this? Did some shady hacker criminal came and give us this, this data? No, it's way more easier. You can buy it. So this landscape, what you can see here is all the companies who are having contact with, with this very sensitive or very personal data. It's a business universe for itself with hundreds of thousands of companies making millions with data. And what some of them do is just watching the web and they sell the analysis of, of their watching as a product. And usual, this is for competitors analysis interesting or for user analytics. And some of them exchange the data under each other. Some take them all by themselves and a very few sell it. To who? To who you might ask, here comes the social engineering part. So first of all, you need a very convincing business partner that was Anna Rosenberg, hi. And this research, this was my, or our alter ego and Anna worked and Tel Aviv for, because I like the city very much, for a very promising startup. And after a while, she even had more than a hundred business connections nobody we actually knew, but so many people are just accepting her friend request without checking, took us two weeks. So that's the promising company Anna is working for, took us two hours. It's a website and the company was made, meets technology because technology meets creativity. So it was just many nice pictures and some marketing buzzwords. The story was easy. Our startup had developed a self-learning algorithm, something like A.I. and this algorithm could predict which products people are willing to buy in the future and to train this algorithm, we would need a massive amount of data, raw data and there was a mysterious German customer in the back who would pay for all of this. And with this promising story, we went on the market, we even had a career site and we even had applications in the end. So we went on the market for I would say a couple of weeks with a valid phone number, with a valid email address and we wrote, we called more than 50, I think nearly 100 companies in the market and asked them if we could have raw data, if we could have data out of the click stream, out of people's lives. And in the end, we got into a deeper discussion with, yeah, two hand full, like with 10 or 20 companies where we really had more phone calls, more emails exchanging and which turned out to be the challenging thing on the research was that I often heard browsing data is no problem, but for Germany it's hard, but it's no problem, we only have it for the US and the UK. Yeah, what did we get at the end? Andreas will now dive into the data. So, we were actually quite astonished what we got as a freebie from this company because when we opened the data, we saw that it was three billion unique URLs that we received for free and the data was structured in a way such that we had 30 days of browsing data in it and each entry and the data files that we got basically contained a URL, some of the URLs were slightly anonymized like here, like you see these little X's which were removed from the data set and each line also has a user ID which is an anonymized identifier, so just a random number and some other information like a time stamp. So, and these three billion URLs linked to about nine million different domains and contained the data of three million, roughly three million people from Germany where some of the users in there had only a couple of dozen of data points and some others had tens of thousands of data points actually. So, now of course what we wanted to do was to try and see if we could from this anonymized data get back to the real people in there, so like de-anonymized the users in this data set and we use the technique which is called statistical de-anonymization for that. It's actually quite old technique and there are a lot of academic and applied research studies on that, so here I linked the paper from 2007 which got a lot of pest coverage back then where researchers managed to de-anonymize a data set that was provided by Netflix for a machine learning competition by correlating the data in this data set with other publicly available information, in this case from a website called IMDB and by correlating this the researchers were able to link many of the anonymized users in the Netflix data set to real user names in the IMDB data. So, how does this work? As I said, the data we got consisted of many URLs with user identifier associated and you can imagine that as having these columns of vectors that contain different attributes and that are associated with a given user and what you can put in these attributes is basically arbitrary, so in our case for example we could put in information that a given user in our data set has visited a given domain, for example Google.com and by doing that we can, for each user, create this kind of attribute vector and now we can see if we can also find some publicly available information that contains the same or some of these attributes that we have in our anonymized data. So we can, for example, I will show you some examples later, try to find publicly available data about web pages that people visited and then try to correlate the data with our data set here and by doing this we can exclude users on the left side here that are not compatible with the public data that we have seen for a given person and if we have enough information available then we can exclude almost anyone, almost everyone in the list in our anonymized data set and if we manage to drill down our possible users to one or maybe a few candidates then we effectively have de-anonymized this person in our data set. So let's try this. In order to do this we first take our input data and convert it into a matrix where each row in the matrix corresponds to one user and each column corresponds to a domain that this user has or has not visited. So whenever a given user has visited a domain we put a one in the entry of this matrix and if not we put a zero there. So this gives a very large but sparsely populated matrix that we can then use in order to compare our data set to publicly available information. Now the algorithm that we use for that is quite simple. So as I said we generate this matrix and then we can generate a feature vector which is called V with our public information. So there we would just put domains that we find on the public web for example and we then just multiply these two things together which gives us a new vector that contains for every user the number of domains, so to say the number of domains from our public feature vector that he or she visited and we can use that vector again and just see where the largest value is and the largest value will often give us then the user which corresponds to the public data and which is linked to the anonymized data in our set. Now you might ask yourself how well does this actually work. So here I have an example from our data set. Here we took from our 3 million users about 1.1 million those users namely that have at least 10 different URLs in the data set and we just looked for a single user how on like having knowledge about a domain that he or she visited reduces the number of possible users in our data set that are compatible with having visited these domains. And so you see in the beginning we have about 1.1 million users and then if you take the first domain in this case it's the gaming website gog.com you can see that there are only 15,000 users round about left in the data set that have visited this domain and now we say okay maybe our user is also a Kunde at the Deutsche Telekom and if that's the case you can see that these combined two domains have only been visited by about 367 users so from 1 million we already have drilled down the number of possible users to only a few hundred and we can do that again with a few other domains and as you can see after only four domains for given users we already have only one compatible user in our data set and we would so to say already have deanonymized this user. Now the question is of course can we actually get this public information that we need to do that and I want to show you two examples of that. So the first example, for the first example we use Twitter. As you know people they like to tweet about what they're reading and they're posting that on their timeline so what we did was to pick a random user from our anonymized data set that had a Twitter profile so had a Twitter link so to say and we then use the Twitter API to scrape all the URLs that this user posted on the timeline in the given time period and from these URLs we extracted the domains that are associated with them and this information we can then simply feed to our algorithm and see so to say if we can find the user again. In this example you can see that we have about I think nine different domains so mostly development websites as you can see like GitHub, some blogs, some documentation pages and what I show you in the parenthesis there is the number of times this URL appeared in our data set so you can see some of the URLs appear quite often. GitHub for example 2.5 million times some others are quite rare in our data set they appeared only 129 times. So now if you run that through our algorithm we can plot the results like this so what you see on this graph is on the X axis just a particular user that we are looking at so they are sorted completely arbitrarily and on the Y axis you see the number of matching domains from this set on the left that the given user has visited and you can see that of these 100,000 users about that have a Twitter profile in our data set that most of them didn't visit any of those domains but a few of them visited at least one, two, three, four, five, six or seven and one user with the seven ones, seven domains is actually the user we were looking for so this was quite good luck so we were really able to identify this user given only this information here on the left and we did that for about a few thousand users in order to verify it. It didn't always yield 100% success rate but we were always able to narrow down the number of compatible users to a very small group of maybe a couple of thousand users in total so you can see that this technique even with very few information works quite well actually. Now this works also with different kind of identifiers. There's also YouTube data in our data set because that's usually quite interesting for the advertisers so the URLs they were not anonymized and we could therefore extract the video IDs from these URLs and again we played the same game so we just looked for users in our data that had a YouTube, a public YouTube playlist and we used the YouTube API again to download all the elements from this playlist, extract the video IDs from that and then again run it through our algorithm. So doing that we have on the left side the video IDs from a random user again and here on the right side you again see the results of our algorithm. You can see now we have only about slightly less than 20,000 users that have at least one public YouTube playlist in our data set and you can see that most of them didn't see, didn't watch any of those clips at all on the left side but again we have one user who watched about nine of those and that's, that is again the user we're looking for. So it's, you can see this technique is working not only with domains but with a lot of different kinds of data. These were just two examples of data types that we can use for this kind of analysis. We also for example looked at Google Maps URLs so whenever you open Google Maps in your browser it stores the latitude and the longitude of the map area that you're looking at in the URL and of course we can go and extract that and then we can get an impression of what people are looking at in the map. You can see the Germans not surprisingly well they look at Germany and of course their favorite vacation destination Mallorca and having that data you could again search for publicly available and geolocated data. For example here we have public ratings of the user on Google Plus and then you could again use that data to run it through our algorithm and see if some users in our data set are compatible with in this case the geodata of the ratings. And this works as well with different types of identifiers you could do that with Facebook posts and with any kind of like social identifier or URL or information that you have in the data set. To be honest though in most cases we didn't even have to do this because we could already de-anonymize users with a single URL in our data set. That's what we call an instant de-anonymization. And here I show two examples of that. First of all there's a Twitter URL here from the Twitter analytics page. So maybe you know that page you can go there and then you see some statistics about how many people viewed your tweets and how often you got retweeted and the nice thing about this URL at least for us who try to de-anonymize people is that it contains the username and that it's only visible to the user who is locked in. So whenever we see this URL in our data set we can be quite sure that the user to which it belongs is actually the owner of this Twitter profile. And this really allowed us to de-anonymize a lot of people without going through like the pain of having this combination like the statistical method so to say. There are other examples of URLs like that. For example we have here a URL from Xing which is the German clone of LinkedIn. And there whenever you click on your profile picture that you see on the left there what Xing does is it attaches this search query parameter here to the URL which is only shown if you're really clicking on this picture from your own dashboard. And this URL is accessible by anyone but on the other hand it's only mostly the people that click on this from their dashboard that actually see the URL. So again by seeing something like that in our data set we can be really quite sure that the person who opened this URL is the owner of this Xing profile. And that's especially nice for de-anonymization because the URL also contains the real name of the person in most cases. So now that I showed you how we can de-anonymize the people in the data we will show you what we actually found in there. Okay I have two more examples. Yeah one was this guy and I think it's really pretty because I can play you the video so we can scroll through his whole browsing data but I can explain what you would have seen. So this guy he's a police officer and so you can see if you go through his whole browsing history that he's researching topics and that he's going to the police union side and stuff and what you also could do is yeah he is using Google translate to translate a text from English to German. This was the text he had been wanted to translate from Google and it was really funny. So here the axis I did this so I nearly did the whole I completely anonymized it because I didn't want to embarrass him in front of you. So he was investigating a computer fraud and asking another ISP for a specific IP address and yeah he also stated his email address his first name last name and his phone number. So yeah so that's the problem with Google translate because everything you put into the translation box goes into the URL and the somebody who has this URL can see the whole text what you have written down. You can follow this guy through his whole life in August what he was searching so also some funny stuff. Okay one last example I was really thinking about showing this on this conference but I nearly felt forced to do this because this is part of the truth and this is very typical it's I mean when I spent the first days diving digging through the data I was yeah I thought now I knew now I know the internet. So what is special on this profile is that this guy he's a judge and from a German court and he has a really specific taste so if this would have been the video then I could you know we could go on this every day a couple of hours it goes like this and what you also can see and going further through this browsing history that he's waiting for his child so he's also looking for baby surnames and for a stroller and where he can go with his wife for yeah for getting his child or for getting his baby so what I want to emphasize with this last example is that this guy he doesn't do anything criminal at all he's just a normal guy I would say but you see how intimate this data can be and how easily he could be for example blackmailed with this data especially in his sensitive position so or yeah what would you think when this data gets out and his wife could see it or employee or his employer could see it so then I think his life would be dramatically changing so that I think it's very important to say here yeah who did this who collected the data the answer is easy when we saw the detainers and the deepness of the data so we had this suspicion that it must be browser plugins and we did a small test I mean one of the first things I did when we got the data I was searching for colleagues and I then asked a really good friend I found in this data I asked him I told him and then I asked him if he could please uninstall his browser plugins to a specific time he did this and then after the last one he they installed or he uninstalled we had a small time period where we could do something like live looking into the data where we refreshed itself every day and then he vanished from our screen when he did install this like browser plugin it's a browser plugin it is supposed to protect you during browsing and it has 140 million users worldwide we tested this browser plugin then a second time in a virtual machine we made a really set up some kind of scientific set up with an extra website and then we could the guy who did this is Mike Cookets it's a security researcher thanks by the way here and so he did the test and we found him in the data so we could be sure for this browser plugin at this time that this plugin was spying on him what did they say to all of this they after we made the story public then they slightly adapted their privacy policy but they basically say it in their policy that they are collecting web pages visited and timestamps of the visit and then they are going to great great length and that they really try to make sure that the information remains anonymous what you know now after this yeah after everything what we told you that this is nearly impossible so this gave us one of the plugins or the browser extensions that was collecting data but the question was of course how many others are there and luckily in the data set we could also get random identity or unique identifier for each extension version that collected the given data point and by analyzing that we can see how many different extensions are actually generating data and this is what you're seeing on this graph which is a double logarithmic graph so in every unit here the number is like multiplying by 10 so to say and on the x-axis you can see the different extension versions basically sorted by the number of data points that they contributed to the data set and this number itself you're seeing on the y-axis and you can see the most popular extension here contributes already about one billion data points to this data set and if you take the first ten extension versions you can see that 95 percent of the data is already accounted for by those and if I say ten extension versions I don't mean that it's ten different browser extensions but possibly different versions of the same extensions or different variants for example extension versions for different browser for Firefox Chrome or like a new version of the extension and you can see by the number of ideas we have in the data set that it could be up to ten thousand extension versions affected here that are spying on the users but effectively the number is probably a little lower because the version IDs are corresponding probably to the different versions of the same of the same extension so the true number should be somewhere between a few ten and maybe a few thousand or a few hundred extensions I think. So the question is of course why is interesting to use browser extensions for tracking at all and you might know normally tracking works in this way so you go visit the website where tracker is installed and then this website asks you to download some javascript in your browser and then this javascript basically does something and sends some data to a tracking server and of course this is a problem that has been that existed for many years and that's why today many users are using plugins to block these trackers so for example you block origin here and that's a quite effective way to keep most of these trackers from sending data to the remote server and for the data collection company this is of course a problem because the collection efficiency goes down and they can see less and less of the users actually. Now if you imagine that you instead of like having a script installed on a website use a browser extension this is much more attractive for the data collector because now the extension can completely bypass the security mechanism and send the data directly to the tracking server without being blocked by other extensions in the browser and this is of course very interesting for them and the even bigger thing or the bigger advantage of these extensions is that they do not only track the user on the pages where actually trackers are installed but they can track the users on every page that he or she visits so even on pages that have no trackers at all and they would get the complete URL information of the user which of course for like their purposes might be of course more interesting since it gives you a more complete picture of the user. Now the question we often get is can I protect myself against this? Yeah of course the first thing that you need to do is to be very careful with the browser extensions that you're installing so they are very they are various studies as well and projects that are trying to actually analyze these extensions by you putting them in a sandbox so if you're installing a new extension you should always try to see and if you can find it in those lists and if it's actually a trustworthy vendor and you have to always ask yourself the question how is this extension creator making money with that you know is it just an open source project maybe or is it something commercial and if it's something commercial how are they generating their revenue are they maybe using my data for doing that and but even if you do that and if you use plugins like PrivacyVedual or Ublock Origin there's still a possibility that you can be tracked based on your IP address or another identifier that stays stable over a longer time period like a few hours or even a few days sometimes and in order to circumvent or to like alleviate this problem and the only thing that you can do is to have a proxy solution like Tor for example or also a commercial VPN that uses rotating IP addresses exit nodes in order to masquerade your IP address and so if you can do that that's a good solution in order to protect yourself you should of course also be careful about which extensions you're using because some of the commercial ones are also collecting the data yourself you have seen before that the web of trust is actually doing that to their users or was doing it and there are other extensions which are similarly collecting data from their users in order to sell it so always try to find a trustworthy vendor so to say. Now another question we often get is can I hide in my data by acting randomly so can I just maybe open different websites and try to fool the algorithm into thinking that I'm somebody else and unfortunately in most of the cases the answer is no because the algorithm at least the one that we are using in our study is not sensitive against this kind of perturbation and you can think about it in this way because you saw earlier that we are what we were doing is basically comparing the public information that we have about a given user to the domains or URLs that this user opened and if we keep if a single user keeps adding new domains or new URLs to his or her profile then this won't actually change the score because the public information stays the same and since other users are not changing their data it also doesn't change the score relative to other users so effectively by just acting randomly you can normally not hide in your data unfortunately. So what are the takeaways? I hope you could see that it's highly problematic to have this very high dimensional data that is user related and that you publish because it's very very difficult to anonymize it even if you have the intention of doing so and you can also see that the public information that is available about users be it on social media or other channels is also growing so it's becoming more easy to find this kind of public information that you can use then to do the de-anonymization and lastly you have seen that only a few data points in most cases less than 10 are already sufficient to lead back to a single user even in a very large data set of millions or tens of millions of users. So we want to thank you all for listening to our talk and also want to say special thanks to a very big team of people who are yeah who had been working with us believing in us and yeah so yeah thank you.