 So my name is Brian Kenish. I'm going to be talking about how our web browsing history is leaking into the cloud. So I never actually talk about myself much of these things, but I think given the topic, I have to do a little bit here, and it's really going to be more of a confession and autobiography. So about ten years ago, I showed up at DoubleClick, and my job was to figure out mobile advertising. And at the time, you know, no one knew anything about mobile advertising, but I especially had absolutely no clue. So I used DoubleClick's money to get a hold of every mobile device in the world that I could. I got a big pile of ugly phones that looked something like this, except I had way more of them, and there were a couple of cool looking Japanese phones thrown in the mix too. And I plugged these things into a proxy server to see what data they were sending and what data we could target ads against. I still really clearly remembering being kind of shocked when I saw that these things were transmitting location. And I thought to myself, why the hell would anyone want DoubleClick to know almost exactly where they are to see an advertisement? But there was, and of course I figured advertisers would be into this stuff, so I put it in our mobile advertising server. So it turned out we were like seven years too early on mobile advertising and not being satisfied just working at the biggest data collection company in the advertising space. I went to the biggest data collector in the world, period Google. And I was an engineer at Google for a long time, worked on a lot of stuff, but mostly ad stuff as well. So AdWords and AdSense. Later on I worked on Wave, and the last thing I was working on was Chrome. So about 10 months ago while I was happily working on the Chrome team, I started this article in the Wall Street Journal which was about Facebook leaking personally identifiable information to third party app developers. And it sort of got me thinking about the huge amount of data that Facebook was collecting about us. Specifically all the data that they were collecting sort of in an invisible way when we weren't on Facebook.com. So I went home that night, whipped up this quick Chrome extension called Facebook Disconnect. I spent about four hours doing this thing. It was really like a throwaway thing. People seemed to be impressed when I tell them it took me four hours, but to be honest I spent two and a half of those hours making the logo. And the entire code base of this thing is like 20 lines of code. It's pretty embarrassing. So I had done a couple of like personal browser extensions up to that point. I think one of them had something like 36 users. I just released this thing. I figured there might be a worldwide audience for like 50 people the size of a football team. But within two weeks there was an entire stadium full of people using this thing. More than 50,000 people had installed and were running it. And that got me thinking about hey maybe people actually care about this privacy stuff. I know I do. So I wanted to do a follow up extension that did more than just stop your data from going to Facebook. But the problem was again I was working at the biggest data collector in the world. So I asked a lawyer what would happen if I did a broaden extension that included, you know, depersonalizing stuff on Google. And he said I probably would get sued. So I didn't like that idea. So I quit Google and I spent three weeks making this follow up extension which I just called clean disconnect. And that stopped your browsing history from going to all the major social networks and it also depersonalized your searches. So if you did a search on Google or Yahoo it wouldn't be tied to your name anymore. I'll talk about all this stuff in more detail in a second. So anyway this stuff got a little bit of press and in particular a reporter from the Wall Street Journal asked me if I thought there were any big privacy stories that hadn't been told yet. And I said yes, social widgets. So I explained to him what was going on in the social widget space and it went a little something like this. I'm going to do this very quickly because it's a little bit of web 101 so as not to bore you. But let's say we go to a web page and the web page might contain some sensitive information. In this case this page is about depression treatment. So besides the first party content the actual article on this page there's a bunch of third party widgets and content on this page. And one in particular here is an advertisement. So in order for your browser to render this ad it sends a request to the ad server and the request is just a bunch of plain text that looks like this. Obviously it tells your browser where to send this request and in this case this is an ad from DoubleClick. The request also contains this thing called a refer URL which tells the server where the request came from. And in this case it tells the server that we were looking at this page about depression treatment. It has the URL for that page. And finally there can be a bunch of cookies in the request. In this case one of the cookies has an ID in it. So this ID uniquely identifies me. Now most of the people are probably okay with this set of data that is being sent to an ad server because presumably this number while it's uniquely attached to me it's not my name. It's just this random set of numbers. Now I'll talk about later why that's maybe not such a good assumption anymore. But for the last 15 years or so we could have assumed that this was anonymous information that was being sent to the ad server. Now if we go back to this page and look at what else is on this page we also have this bunch of social widgets. So we have stuff from Facebook and Twitter and a new plus one button. And if we look at the request that gets sent out in this case it's going to look really similar. So here we're looking at the request for the Facebook widget. It's going to facebook.com. We get that identical refer URL. And finally again we have a cookie with a unique ID in it. Now this looks almost identical to the request that we just looked at but there's a huge difference here. The difference is that this ID is no longer just a string of numbers it actually points to my Facebook profile. So it's not just my browsing history with a set of numbers that Facebook is getting. Facebook is actually getting my name. They know that I'm Brian Kenish is actually looking at that page. And not only are they getting that information they're getting all the other information that I've explicitly given them. Like my age and where I live and who my friends are. So you'd think with all this browsing history attached to our name that these companies would at least say what they're doing with the data. And at the time I looked up what they were doing and all I found was these 404 pages. There was nothing. Facebook didn't say what they were doing with the data nor Google nor Twitter. So I explained this whole scenario to this Wall Street Journal reporter and he said well that's kind of interesting but how big of a problem is this really? I mean can you quantify how much of our browsing history they're really getting? And I said hmm that's a really good question. Good luck finding out the answer. But he was a good reporter so he kept asking me over and over again and finally I relented and I figured I would answer the question in a way that a Googler would answer the question by writing a web crawler to figure out what the prevalence of all these different tracking companies. So our goals with this crawler were to get a list of the most popular sites on the web and then to go to each of those and crawl them to a link depth of one. So the way a search engine crawler normally works is they'll crawl at least to a link depth of three which means that they go to the home page, get all the links on that page one, then go to all those pages, get all the links on those pages two and then go to all those pages and get all the links on those pages three. But since I no longer worked at Google and didn't have access to a million computers anymore or unlimited bandwidth, we figured for the sake of this experiment it would be good enough to get a small sample and do a link depth of one. And then so for all those pages that we got, oopsie, for all those pages that we got, we were going to extract the third party domain names from all the resources on those pages that sent these HTTP requests with the refer URLs. So I ran this thing over the course of the week, I decided I was going to run it out of Starbucks just for fun. And after a week we ended up indexing the thousand most popular sites. We analyzed just over 200,000 pages and on these thousand sites we identified nearly 7,000 different third parties. And the output of this crawler was this big, ugly spreadsheet that looks something like this. This happens to be an annotated version of it. But I've broken the results out into more viewable chunks here. So the first set of stats that we're going to look at here are the non-social services. So things like advertising, analytics and content services. These are the services that we could presume are anonymous. The first thing I want to point out is how prevalent they are. So the top service here appeared on 23% of the top thousand websites. Essentially they're seeing 23% of our browsing history. So if you think about opening your web browser or going to your browsing history, randomly picking 23% of the pages in there and then sending them to, in this case, GoogleAPIs.com. That's basically what we're already doing. The next thing I want to point out is how much Google stuff there is. So the top five services are all from Google. And the way we did this analysis is that we broke out each service separately. But other researchers have looked at this as an aggregate. So for example they found that Google services, some Google services appear out of the top hundred sites, 97 different sites. It's pretty amazing just how prevalent Google stuff is. And the last thing I want to point out gets that anonymous issue which is that most of the services on this list are part of big data companies that also have personal information. So Google obviously has personal information. We log into things like Gmail and Docs and so forth. Adobe has personal information. They have Photoshop online. Amazon obviously, you go and buy books there. Just under the top ten on this list is Atlas which was purchased by Microsoft who obviously has personal information. So at any point these big data companies could decide to link up their anonymous data sets with their personal data sets. And what that would mean is that not only is your browsing history going forward being tracked but all the past 15 plus years of your browsing history could instantly be attached to your name. And this isn't some hypothetical scenario. It's actually something that happened at Google a couple years ago. The Wall Street Journal published some leaked documents where Google was debating linking up their personal and anonymous data. So it's something that could actually be happening already or certainly could happen in the future. So this is advertising, analytics, content, everything that's not social networking. Everything that doesn't have your name. So this next shot is, this next set of stat is the social services, everything that does have your name. And you can see that Facebook is hugely prevalent. They're on a third of the top thousand websites. The really amazing thing about this number is that at the time we did this analysis the Facebook like button had just turned one year old. So in one year they went from zero percent to 33 percent. Likewise when we did this analysis Google which was on a quarter of all the top thousand websites didn't have the plus one button yet. So these stats are really just directional. They're probably going to increase hugely over the next year or so. And the stat I was probably the most surprised about was Twitter, which Twitter, their social widgets were younger than Facebook and they were already on a fifth of the top thousand websites. So these guys are getting a huge chunk of our browsing history with our names. So in summary we identified 350 different services that get at least one percent of our browsing history. We identified 33 that get at least five percent and 16 that get at least ten percent. Now this data ended up getting published in this Wall Street Journal article. I'll provide a link at the end. Some longer tail data got published in the CNN article. But we really like I said this data was directional. We wanted people to be able to see what was happening in an ongoing basis. So we've created this tool that we're putting out today and I bet you haven't seen anyone type into their slides yet at Dofcon, which I'm going to do here. This is a little address bar. So sorry that address was disconnect.me slash db as in database. And we're trying to accomplish two things with this tool. So all those set of stats that I just went over quickly, we're throwing into this tool. So we have a set of automated stats about all the top websites. And we have a list of them in here. You can drill down and look at specific data on any site. I'm going to look up Yahoo here. And I'll just zoom in here. So you can see Yahoo is 75 different unique third parties on their site. When you go to an average Yahoo page, there are almost five different third parties on the page, which means that not only are you sending your browsing history obviously to yahoo.com, which is where you are, but your browsing history is going to five other places. So the second thing that we wanted to address with this tool is the problem that I mentioned earlier, which is that while we can see where our data is going, we can see that it's going to Facebook or Google, they don't really do a good job of telling us what they're doing with their data. So we've teamed up with Mozilla to work on this icon project, where we can turn every website into a set of privacy icons that make it easy to identify what they're doing once they actually get our data. And if we go back and look at this yahoo page here, you can see we have these five icons that represent whether yahoo is selling our data, whether they give it to advertisers, how readily they turn it over to authorities, and how long they keep that data. And this is actually a crowdsource project. So we have a wiki-based platform here. You can go read the privacy policy of any of the sites that we have in here and then set their icons according to what they're doing. We already have a JSON API, so we're hoping to make this widely available to other tools beyond our own. So if you're interested in learning more about this stuff, I'll give you a quick few URLs here. So that Wall Street Journal article is at jump, that's j.mp slash ttt, as in tracking the trackers, WSJ. The CNN article that I pointed to is at jump slash ttt cnn. And that database tool that I just quickly demoed is at disconnect.me slash db. Now understand there's a QA room where I get to answer questions. So I have a bunch of VC money to wreck the advertising industry. The best question you can ask me is if we're hiring, and I'm happy to answer that one. So thank you very much.