 Hello, everyone. So today I'm going to talk about browser fingerprinting, and in particular about an experiment that we did at EFF to measure how fingerprintable web browsers were called panopticlic. So let's start about, before we get to browsers, let's talk about identifying information. Now when we ask what kind of information identifies a person, we have some standard sort of answers. Like, oh, if I know their name and address, I probably know who they are. But there are some more surprising examples of identifying information. There's a paper by LaTanya Sweeney from the 90s that showed someone zip code and their date of birth and their gender, then you have about an 80% probability of being able to identify them uniquely. Now that's a bit surprising and counterintuitive, so let's see how it happens. If you start with 7 billion people on the face of the planet Earth, you rapidly narrow yourself down, if you know a zip code, to a group of about 20,000 or maybe 50,000 people, maybe in some cases 100,000, but not very many. And then you can divide that number by 365 because you know their birth date. You can divide by a more complicated number that's about 70 because you know how old a person is. And then you can divide in half again because you know whether they're male or female. And it turns out that if the zip code you started with had fewer than about 50,000 people in it, you now probably have a unique person at the end of this program. So there's a mathematical measure you can use to say how identifying a set of facts about a person is or how much information is required to identify someone. If you need more bits to identify them, then each bit doubles the number of possibilities. And if you're learning more facts about them, then each bit you learn halves the number of possibilities. So you can think of these as trading off against one another. So for instance, on the face of the planet Earth, with 7 billion of us people, you need about 33 bits to learn the identity of one of us. And if you learn someone's birth date, what day of the year they were born on, you learn about 8.51 bits. So if we talk about a random variable you might measure like someone's birth date, you can talk about the amount of information you learn when you learn a particular value of that random variable. So if we're talking about birth date and you learn that my birth date is the first of March, then you've learned 8.51 bits about my identity. If, however, you learn that my birth date is the 29th of February on a leap year, then you get a bit more information because the likelihood of that being true is only a quarter of what it would be for any other birthday. So you get more information, 10.51 bits, typically. So we call that first measurement the surprising or self-information of a fact you've learned. So a surprise that I was born on the 29th of February, 10.51 bits. And then we can talk about the entropy of this type of measurement, which is the expectation value of the surprise. So if you have a probability distribution, you measure the expectation across all of that. There's some formulas you can use to look at this. We have a paper on our website that has all the mathematics in a bit more detail if you're interested in it. A point to note is that you can't add surprises together. If you learn someone's birth date and then you learn what city they were born in, those two things are probably independent variables, not entirely but close to. So you could add the number of bits together. But if you already know someone's birth date and then someone tells you their star sign, you're actually not going to learn any more information. So if you want to know how two of these measurements add together, you need to do some fancy stuff with conditional probabilities. So what use is all of this theory? Well we can apply it to tracking web browsers. Now what do I mean by tracking web browsers? Two things. One is if you go to a website on day A and then you come back three weeks later and you do something else, can the website link those two acts of yours together? And also if on a given day you go to two different websites and you maybe look up one fact on one and then another fact on another and then the question is could these two sites get together and combine the facts and get a deeper picture about you, either of those would be tracking. Now there are at least three ways that are widely discussed to track web browsers. You can use cookies, which are these little bits of information your computer will store and send back and sites can put serial numbers and tracking numbers in them. But of course we all know about, we're at DEF CON, everyone here knows about cookies, you all have browser settings to delete them or limit them. So we can survive cookie tracking. And IP addresses, everyone knows about IP addresses but IP addresses can be hidden by a proxy server or you can move from your house to your friend's house to the cafe to work and each time your IP address changes. Lastly there are these quite nasty things called super cookies, the most infamous example being flash cookies. They are essentially exactly like cookies except when you tell your browser not to set cookies, flash still has some other settings page that you haven't changed somewhere and so flash is still storing its own kinds of cookies. And you can think of these things as hoops that you have to jump through in order to get privacy on the web. If you want to not be tracked you have to avoid tracking by cookies, avoid tracking by IP addresses, avoid tracking by super cookies and then if you do that it's time to talk about whether you can be tracked by a fingerprint. So fingerprints are like Latanya Sway's example of tracking someone with some inane facts like their zip code and their date of birth. It turns out that the characteristics that web browsers have like the version of the browser, what operating system it's on, et cetera, combine together in the same way as those other facts and perhaps they make your browser unique. Now there are different degrees of uniqueness that you might get out of the version information on a browser. You might have complete global uniqueness. So you're sitting there in the third row, your browser is completely unique in the whole wide world and we know it. So when we see, we can track your browser. But perhaps that's not true. Even so, browser fingerprints may be a problem because they mean that your IP address combined with your fingerprint is uniquely identifying. So if you guys all have your laptops opening or surfing during my talk, please don't do that. But if you do, then perhaps there are 2,000 people behind a proxy server here at DEFCON or surfing at once and so that gives you some degree of anonymity except if you all have fingerprints that are not unique in the world but unique at DEFCON at least, then that would be a tracking mechanism. Another possibility is that fingerprints don't track you all the time, just whenever you go and delete your cookies, your fingerprint is unique enough to link your old cookie to your new one to give you a new copy of your old cookie back again, a zombie cookie if you like. And the last thing that makes fingerprints really nasty is that unlike cookies, they're automatically the same when different sites collect them. So if a website over here sees your fingerprint and a website over there sees your fingerprint, they see the same thing. If they both tracked you with cookies, they'd get different cookies and they'd have to have some sneaky process to link them together whereas fingerprints are automatically the same. So we heard a lot of rumors at EFF about fingerprints. We heard that some web tracking and analytics companies had started using them. We'd heard rumors that web-based DRM systems were using fingerprints to track people. We'd heard that they were being used as a backup authentication mechanism for financial systems which is maybe less of a problem than the first two. And we got really curious about how effective this method was. We were also worried about a more mundane question which is every single time you go to a website, almost all websites are configured to log your browser's user agent string. And we were wondering how much of a problem that was. So we decided to get some numbers to find out. So we put up this little website, panopticlick.eff.org. And if you went to that website, if you still go there, you see a page like this that tells you what's going on and then gives you a little button you can click if you wanna be part of the experiment. And if you click on that, you get a page that looks like this that says, oh, your browser appears to be unique, at least up to 20 bits, possibly more of identifying information coming from your fingerprint. And then there's this little table here showing what all the component measurements were and how identifying each of those were on their own. So you can see there's a user agent string, there are accept headers, plugin details, et cetera. Really telling stuff in there, this font. I'll talk about where these came from more in a second. So these are the eight measurements we had. These top three were just, they're just things that you always send to a web server when you ask for a page. The next four come from JavaScript. So there's a little bit of JavaScript that runs in the page and if your browser supports JavaScript and it doesn't have it disabled, the JavaScript will collect this information and send it back to us with an HTTP post command. And lastly, if you have Flash or Java installed, we'll go into Flash or Java or both and ask those plugins for your list of system fonts. So yeah, so we have these different measurements. It'll turn out that the two most problematic ones, they're all kind of problematic. The two most problematic ones are these plugins here and these fonts at the bottom. So there are a lot of things we didn't collect but which you could use to make these fingerprints even nastier. And in fact, it turns out we've since seen that there are companies in the private sector that will sell you a fingerprinting system that doesn't just do the kind of eight things we did, but actually also does a lot of this other stuff. One particularly nasty thing is you can measure the clock skew of a quartz crystal. That is how much faster or slower than another clock your computer is and that's very hard to hide and it's unique to your hardware rather than just your software. You can measure the characteristics of the operating system to TCP IP implementation. You can measure the order in which the headers show up. There's lots of stuff in the ActiveX, Silverlight and other Adobe libraries that we didn't have time to dig through and find the news but you can do that. There are quirks to the way that each browser has implemented JavaScript that could be identified with the right code. There's a really nasty bug that's just recently started being fixed in browsers where you can measure the history of a browser using CSS detection. And some of these things we didn't collect because we just didn't have time to implement them. Some we didn't collect because we didn't know about them. Once we put the site up, we got a lot of emails saying, hey, you could collect this thing as well. And lastly, some things like this CSS history, we weren't sure would be stable enough to include in a fingerprint without some kind of fuzzy matching code that we didn't have. So all of our results, I guess the point here is that all of our results should be taken as a kind of optimistic story about how much privacy you have. With a really good fingerprint, the fingerprint is more powerful and more revealing. So the way we handled the data off our site, we set a three month persistent cookie and we stored an encrypted IP address with a key that we threw away. And we used those primarily to avoid counting you twice. If you came back, we wanted to know if you were the same person with the same fingerprint or another person with the same fingerprint because those are two very different things. And we had an exception for that, which is if at a particular IP address, we saw cookie A and then cookie B and then cookie A again, we thought that's probably or at least potentially evidence that there are more than two computers behind that IP address that perhaps behind a NAT firewall that have the same fingerprint. And that could be important because it could be a sign that there's a corporate network there and there's some sysadmin who clones all the same code out onto all the machines every day. And so there are genuinely multiple machines with identical fingerprints there. And that gives the people in that office some protection. So we decided not to treat those as repeat visits from the same browser if they were interleaved. One thing to note is a lot of people were confused by the numbers we had on our site. Those use the cookies, but not the other methods of avoiding double counting because we had to compute that stuff on the fly for millions of visitors. The data set I'm presenting in this talk has more fancy control for that stuff. So we got a pretty big data set. We had two million hits, about a million distinct browser instances there. The people that we were measuring are not representative of the entire web user base. They're people who read slash dot or boing boing or come to the EFF site. And so they're more of a data set of privacy, people who know about and care about privacy. However, as I said before, you have to jump through three hoops to not be trackable already on the web. And so we think this is kind of the relevant data set to ask do people who block cookies and know about IP addresses and how to hide them and so forth end up being trackable by their fingerprints instead. And another point is that the data in this talk is all based on the first 470,000 odd instances. So the first half of the data set. Turned out people were really unique. 84% of the browsers that came to our test site were unique completely in the data set. If you split the data set up and say let's just look at the browsers that have either flash or Java installed and that's the relevant set if you're talking about desktop browsers, your uniqueness rates go up to 94% and only 1% have a fingerprint that we saw more than twice. So this is the same thing on a graph. Note that this graph has log axes. If you drew this graph of how common the different fingerprints were without log axes, you end up with a graph where the line runs exactly along this axis all the way and then exactly along this one. So in order to see any structure you have to put it on log scales. But the important thing to note is that 84% of the data is right down here in this section on the tail. It's unique. And then there's another group of people about 20,000 people who have an anonymity set size of two. I mean there were two browsers that had that fingerprint. You know a small group with three, four, five and then right up the other end you have a small number of browser fingerprints that were not very unique. The one right at the top is a Firefox instance that's not running JavaScript. So like a recent version of Firefox with no JavaScript, you have a decent amount of anonymity. There's an interesting statistical question that you might ask which is, okay, sure, you saw 94% or 84% uniqueness in your data set but that was only 500,000 people. Would people be less unique if we could get data for the whole one to two billion people who use the web? And this is an interesting statistical question. I have a theory about how to solve it involving multicolor simulations. You try a hypothesis probability distribution. You run it through a simulation. You see if it produces a graph that looks like that. But we didn't try to do this because in a sense our data set which is just a measurement of privacy conscious users is not meaningfully representative of the one to two billion browsers in existence. So if someone else has a less biased data set, you could do this statistical question. We didn't try. Now any graph that tries to show you everything that's going on in this data set is gonna be really complicated because it's half a million data points. But I'm gonna try with a couple of them. This one shows for each category of browsers. So Firefox, IE, Opera, Chrome, Android, iPhone, Conqueror, BlackBerry, Safari, and then a lump together collection of links and other text mode browsers. For each of those, how good or bad was it from a uniqueness trackability point of view? And so if you look at this graph, anything that's over here is a proportion of uniqueness. These things were completely unique in our data set. The other end we have the least revealing fingerprints. So let's take an example. Firefox is this black line. It follows this curve down here, where it has a little bit of a tail in the non-unique area. That's because some people had JavaScript turned off in Firefox or they were running Tor button and it shows up as Firefox. And then right up here, there's a very large number of unique Firefoxes. All of the desktop browsers aside from Firefox are like that but without this little tail of non-unique people. So generally desktop browsers are bad. The browsers that did well, iPhone, the iPhone does very well. It's not very fingerprintable. It's this purple line and there are quite a lot of iPhones that are not unique. That's perhaps not surprising because there aren't yet plug-ins and font variation on iPhones. Really, all you're talking about is what time zone you're in, what language you're in, maybe which version of the iPhone OS you have, but there's not very much to fingerprint an iPhone with. Android does almost as well, not quite as well because there are more iPhones than Androids, but those phone browsers look pretty good. In practice, of course, they have really bad cookie settings so people who use them probably get tracked by their cookies, but this was a good result for the phones. I'll let you guys ask more questions about these crazy graphs at the end. Now, if we look at the variables that we measured, we had these eight measurements and say which ones were the problematic ones. This table measures to first order, which they were. So user agent is pretty bad. It's 10 bits of information. Every time you log in your web server logs, the user agent, you expect on average that you're narrowing the population down to 1,000th of what it could have been if someone wants to browse anonymously. The things that were worse, plugins, 15.4 bits, fonts about 14 bits. So these things that your browser publishes are very revealing. If you wanna ask, okay, what is the distribution of different values for all of these things look like, you end up with this crazy graph here. It's in our paper as well. I can try to explain it. It says how many people, for each of these measurements separately, there are eight different measurements. For each one, how many people fell into an anonymity set size of K for each of the measurements? So an anonymity set size of one means you are completely unique because of your fingerprints or your font or your plugins. So you see up here, there are a lot of people who are unique, 200,000, 250,000 people who are unique just because of the plugins they had installed in their browser. There were 200,000 who are unique just because of their fonts. 20,000, 25,000 who are unique just because of their user agent, et cetera. As you go down here, these are less identifying values. And then along this line, we have right up to the end, having cookies enabled wasn't a very revealing fact. So if anyone has questions about, oh, what does this data look like? You can come back and study this graph and pour over it for half an hour or ask me questions about it. So another really interesting question you might have is, so sure, you can identify people, but don't these fingerprints change over time? Are they really a stable way to track someone if they could upgrade their browser or install a new font and suddenly the fingerprint would be different? And so we decided to check this. This graph shows the set of people who visited Pan OptiClick exactly twice. So we wanted to throw away people who might have been playing with the site, trying to optimize their uniqueness or tweak things. We just wanted people who came exactly twice with at least an hour or two in between the two visits they did. So they didn't just hit reload, they came back later. And then we said, as a function of how much later they came back, what was the probability that their fingerprint had changed? And you can see as more time has passed, the likelihood that the fingerprint was different when they came back goes up. We measured this, by the way, with cookies. So there was a cookie that you could reliably use to see the same person and then you can see if the fingerprint changes. So actually fingerprints don't last very long. The half-life of these things is four or five days. So perhaps that's actually a really good sign. Perhaps fingerprints, while they're instantaneously identifying, aren't a stable way to track people over time. Unfortunately, this turned out not to be true. So the way we did that is we said, okay, your fingerprint has changed. Can we do some kind of fuzzy matching algorithm that we'll see if your fingerprint later after the change was uniquely tie-able to your fingerprint beforehand? And I implemented a really hacky algorithm to do this. It just says if only one of those eight measurements has changed and it hasn't changed very much and that maps to a unique fingerprint from beforehand, then let's guess that it's you. And it only tries to do this if you had something quite revealing like flash or Java installed. So this algorithm guesses about two thirds of the time, but when it does guess, it's 99% accurate. So it has a 99% chance of correctly guessing which fingerprint you changed from and less than 1% chance of getting it wrong. So even though fingerprints change quite fast, even though the half-life of a fingerprint is five days, actually you're still trackable once your fingerprint has changed. So there were really only four examples of categories of browsers that survived this. I mentioned them all quickly in passing. If you block JavaScript, perhaps with no script, the Firefox extension to do that, you're in pretty good shape. If you use TorButton, TorButton zaps the plugin list and the TorButton developers knew about a lot of these attacks and anticipated them in various ways. So TorButton, you don't have to use Tor. You can just use the little TorButton Firefox extension. You're in pretty good shape. If you use an iPhone or an Android and you manage the cookie problem, you're in pretty good shape. And lastly, this small percentage of systems that were behind firewalls appeared to have the same fingerprint. We saw about 3% of IP addresses that had multiple visitors coming from them exhibiting that kind of behavior. So that 3% of systems maybe has some kind of anonymity, although it's a bit hard to distinguish that from a browser's private browsing mode. And it would also be the case that if you implemented the clock skew hardware-based fingerprinting, you could probably tell people apart, even if they have a firewall and a cloned fingerprint. So currently, there aren't very many web browsers that do well. We also saw some other really interesting things. One interesting thing was that sometimes privacy enhancing technologies are the opposite. Something that's designed to hide your identity turns out to be the unique thing that tracks you. If you install a flash blocker, for instance, that has a unique signature that you can tell, okay, this browser has flash installed, but we're not getting an answer back from the flash plugin when we ask it for fonts. So people who've done that were all pretty much unique. People who've forged their user agents. Actually, something I forgot to say about this graph, let's go back here. Let's look at this graph here. Remember how I said this is the iPhone in pink, this pink line? And you go over here and these iPhones aren't very unique. And then there's a little group of iPhones up here on the right that is unique. What's going on there? Why are some iPhones unique? Well, it turns out if you look at this, actually a large portion of those iPhones aren't iPhones. This graph just believes the user agent string, but if you go and look, some of these have flash player installed and other stuff that gives them away is actually not being iPhones. They're desktop browsers masquerading as iPhones. And we scratched our heads when we saw this and thought, why are so many people pretending to be iPhones? And then we realized that actually AT&T had had a promotion for a long time where they gave you free Wi-Fi for your iPhone. And the only way they checked that was the user agent string. So I think they fixed that now. But the lesson here is, okay, that's funny, but also if you wanna fake your user agent string, you have to fake more than your user agent string. You have to fake all the other stuff that distinguishes an iPhone from a desktop machine. And if you don't, then you're in danger of creating yourself a unique fingerprint. Similarly, there was a very distinctive bunch of Firefox machines that supported Internet Explorer's ActiveX user data cookies. Okay, those are probably not Firefox. The exceptions to this problematic rule about privacy enhancing stuff, defeating your privacy, the noteworthy exceptions were NoScript and Tor button, which both are fingerprintable, but the amount you gain from having them turned on outweighs the amount you lose from them. Another lesson here is that there's a trade-off, well, sorry, actually, wrong slide. Another lesson here is that if you're designing an API that's gonna run inside a browser, you should never ever offer some call that returns a gigantic list of system information about the machine that you're running on. So this was true both of the plugin list where you just ask navigator.plugins, hey, give me the plugins on the system and you get back a list of all of them and the version numbers of all of them, because that's gonna make a lot of people unique. Similarly, don't return a list of all the fonts. If you really need to show people a particular font, make them ask about the specific font rather than being able to ask about all fonts at once. Perhaps an even better solution to this would be to not have your system fonts displayed in your browser at all. Perhaps if a website wants to render this crazy Frankenstein font I've got here, it should have to give you the TTF file along with the website. The problem here is that even if we block the bits of Java and Flash that give font lists back, there are some nasty websites out there that show that you can detect fonts using CSS, which is almost unblockable. You just render the font inside an invisible box and then measure how white it is, and so if the user has the font installed, the box is with A and if they don't have it installed, it's with B. And there's a little cute website called Flipping Typical that demonstrates this. So this stuff is hard to block. Another lesson is that finger printability trades off against debugability. So if you look at a user agent string, just a hypothetical one here, and this is the typical stuff it has in there. My operating system, I'm running X, particular hardware platform, my language, precise date that my gecko was built on, all this stuff, why is that in there? Like why does every website I go through need to know what date my browser was compiled on? Like what are we doing here? The answer is some people thought that maybe one day we'd wanna debug something. And when we did, wouldn't it be awesome if we'd already logged on the server side all of the stuff we could possibly want for debugging a client-side issue? And okay, fair enough, maybe occasionally there's some glitch somewhere where having this version information is useful. But there's a trade-off between privacy and debugability going on here. And right now, browsers are all configured right up at the extreme debugability and extreme non-privacy end of this spectrum. And at least perhaps when you enter a private browsing mode in your browser, it should be making that trade-off the other way around. The same is true of plugin lists. Right now, if you look in the plugin list and say, well, what version of Flash do you have? It's not Flash or Flash 10, it's Flash version 10.1 off of 53. And so all these little facts, you have 10 plugins and you get this much version information about all of them, it adds up. So the general lesson, right back at the start I said there were four kinds of attacks that you could use fingerprinting for. One is global uniqueness and in a lot of cases it looks like browsers are globally unique. Not all cases, but a lot of them. The second case where you have an IP address plus a fingerprint, there you're almost guaranteed to be able to track someone. And you can definitely undelete cookies that people have deleted and you can definitely link these things across websites. So this is a serious privacy problem. Right now the only things you can do are things that power users are gonna do. You can do no scripts, you can do tour button, but you can't tell your grandmother or whoever else, like just people who aren't really comfortable with complicated user interfaces to use these plugins. So that everyone else is gonna need to wait for the browsers to find a solution to this. Fortunately, they've sort of started. We've been talking to the Mozilla people and they're interested in attempting to fix some of this stuff at least in private browsing mode. Google is maybe a step behind, but also someone interested in trying to tackle this. So perhaps we can come back in a year or two and say we've made a small amount of progress on the problem. Anyway, that's all I have to say. As always, if you like the work we do, either this kind of technical stuff or our legal work support EFF, there's a membership booth over in the vendor area and I'll take questions if anyone has any.