 Alright, Mike looks up. Hi everybody, I'm Scott Walshuk, and this talk is crawling BitTorrent DHTs for fun and profit. I'm pretty sure I didn't post my password on the slides. Alright, so I'll start with the bad news. So BitTorrent sites have this tendency to get sued, right? Like, last year we had the PiratePay 4 lose a court case, that they're still appealing. We had Mininova get forced to take all the copyrighted files off of their website. Going back a couple years we had Oink get shut down, one of the biggest private BitTorrent sites. Going back even further, we had Supernova get shut down in 04, and then coming back to the present we also had Isohans having some legal trouble in the US. So, not great news. Another piece of bad news that's been publicized recently is the possibility of large-scale BitTorrent surveillance. So there was this paper in Leet earlier this year called Spying the World from Your Laptop by some folks from France. And in this paper they talked about how by crawling the PiratePay, Mininova, and a couple of the other major trackers you can track literally like tens of millions of BitTorrent users using just one machine. So we knew BitTorrent wasn't anonymous, but it's kind of striking that you can do all the spying from one machine as long as you crawl the centralized BitTorrent search sites and so on. So in response to this, especially their legal trouble, in a blog post called World's Most Resilient Tracking the Pirate Bay post said that they were switching to or at least promoting DHT-based BitTorrent tracking. They said a couple of the benefits of this were that there's no central tracker that can be down and you don't have to rely on a single server. If you want to check it out it's blog post number 175 on the site. But there's a problem with their switch to DHT tracking at least from their perspective I think and it's that the DHT, just like all the centralized torrent sites, can be crawled. We can download the whole thing. And the fact that the DHT is crawlable means that pirates have both got both good news and bad news for pirates. So the bad news for all the pirates in the crowd is that the content orders can still track you through the DHT and it's even harder to stop them because you don't have a centralized point where you can do things like IP blocking, rate limiting and so on. The good news for pirates is that even if all the content industries got really together legally and they managed to sue all the torrent sites into oblivion overnight you wake up there's no backups. As long as you found out about it quick enough you could have a pretty big torrent site back up within a couple of hours. And so these two facts are going to have some consequences for attacks on BitTorrent that I'm going to revisit toward the end of the talk. Alright so that's the basic idea. Now I'm going to move forward and look at the details for the rest of the talk. Alright so first the background. I know a lot of people aren't familiar with this so what the heck is a BitTorrent DHT? So think back to the last time you downloaded a torrent. You probably went to a website that looks something like this, the Pyrope, maybe one of the other BitTorrent search engines. You typed in whatever you were looking for and you ended up at a page that looks like this. You cleverly avoided the really huge download button which in my opinion is like the world's most misleading ad. It's just an ad that doesn't download your torrent. And you click the download this torrent link right under the huge download button. And your BitTorrent client fired up and it connected to some people around the world and downloaded your files. And so the question I'm going to talk about is how does your BitTorrent client find those peers? So the old school way of doing this from the original BitTorrent design is to use a BitTorrent tracker. And the trackers are a bunch of servers that are listed in the dot torrent file. Let's say that's your laptop in the lower left there, 1, 2, 3, 4, 6, 8, 8, 1. Your tracker or your laptop sends an announce request to the tracker for the info hash of the torrent which I've represented as X1, 2, 3, 4, A, B, etc. And for purposes of this talk, the info hash is just this magic random looking string that uniquely identifies the torrent. Tracker says, OK, I know that 1, 2, 3, 4, 6, 8, 8, 1 is downloading this torrent. Great, it makes a note of that and it sends you back a list of other peers that also announce themselves as downloading this torrent. So and then your BitTorrent client takes that list, connects to them, does the BitTorrent magic and you get your files. But there's a problem with the centralized method of tracking and it's that trackers tend to go down. And by go down, I mean mostly that they tend to get sued, right? One of the, I remember correctly, one of the pirate phase big legal problems was that they were running a tracker, hence the blog post. So what you really want, and this is like a canonical distributed systems failure where you have one server that goes down and your system stops working. So what you really want is something more reliable and what the BitTorrent developers have been pushing forward for the past couple of years is tracking with distributed hash tables which I'll abbreviate for the rest of the talk as DHTs. So what is a DHT? Well, so there's actually years of academic research on this that I'm going to condense into about two minutes. In short, it's a peer-to-peer network, hence the distributed, that stores key value pairs, hence the hash table. So it's the big hash table in the sky or in the cloud if you want to use the, you know, recent term. All right, so how does this work? You have your laptop again, 1, 2, 3, 4, 6, 8, 8, 1. You can store a value under a key. For example, you might store 1 under the key O, N, E, and you can look up a value by key. So you might do a get for the key T, W, O, and if all is right with the world, you get the value too. All right, so it's just a big distributed system that almost a hash table like a dict or a hash in Perl or Python or your other favorite programming language. We need a bit more detail for this talk. So it turns out, so this is a peer-to-peer network. So we have to have some way of figuring out which peers are going to store the data. So the way it works is that both peers and items of data gets 160-bit, 20-bit, or 20-bit IDs. For purposes of this talk, again, the peer ID is picked magically at random and is just guaranteed to be probably unique. The data ID is picked by taking the SHA1 hash of the key. So SHA1 gets you a random-ish number that is probably unique for this key. And then what you do is you have the peers store data that has a close ID to theirs, numerically speaking. So if I'm peer 10, I might store data items 11, 12, and 9, but probably not data item 200. So how do you do DHTracking? Will you DHTracking? Will you just throw away the tracker and you replace it with this big distributed system? So again, you have your laptop trying to get peers. First thing you do is look up the info hash of the torrent as a key in the table. Remember, this is what you sent to the tracker before to tell it which torrent you wanted. And what comes back as the value in the DHT is a list of the peers that are also downloading the torrent. Then to make sure you show up when somebody else does this, you do a store for your IP address import under that same key, the info hash of the torrent. So we do the same thing that we did with the tracker, but we replace the vulnerable single machine with this distributed system that's much harder to take down. And by the way, the DHT is formed by all the bit torrent clients that have DHT support. So your bit torrent client just joins this network when you start it up. Okay, so we've got DHTracking. The other piece of the puzzle that the pirate bay added in that blog post is magnet links. And so the magnet link looks like this up at the top of the slide there. It starts with magnet colon. And the important part of the magnet link is what I have in bold there. It's an info hash. Remember, this is the identifier that you were sending the tracker into the DHT. So when your bit torrent client sees one of these, it can go directly to the DHT and get a list of peers right away. No dot torrent required. It's not entirely clear to me why they started using this. It's smaller than using a dot torrent file. But I think that they thought it improved their legal position. It's not clear to me that it does, but I'm not a lawyer, so what do I know? Okay, so we have DHTracking and magnet links. Now, the talk's called crawling DHTs. So how would you build a DHT crawler? What we did was we dug through the views bit torrent client. It's also called azureus. You might be more familiar with that name. And we looked through all their DHT implementation stuff. And we re-implemented the protocol in C for efficiency. So it was kind of a huge piece of Java code. We wanted something a bit faster, so we had to redo it in C. Once we had this efficient DHT client, we used to perform what's called a civil attack on the DHT. So a civil attack is this generic attack on distributed systems. It's named after a character from a novel or a play or something with multiple personalities. And what you do is you give your machine multiple personalities in your distributed system. So for example, you simulate thousands of clients in whatever system at once. So for example, we put thousands of DHT clients on this netbook here, let's say. Once we've done that, we've got thousands of randomly spaced or pretty much randomly spaced IDs in the DHT. And so when someone stores something, it's going to be likely that the value is going to land on one of our simulated clients, which means it's likely to land on the single physical machine. So when we're doing the civil attack, we can just sit and wait for values to come in. And so we use a couple of other tricks, which I talked about in that paper reference at the bottom of the slide. But the result is that you can cheaply capture almost all of the views DHT. And again, if you want to read up more about this, the paper's called Defeating Vanish with Low Cost Civil Attacks Against Large DHTs. And there's a link to it in the links at the end of the slides. All right. The one other thing I have to tell you is that there's actually two bit torrent DHTs, as if this wasn't complicated enough. There's the mainline DHT and the views DHT. So the views DHT is just for the views client, and the mainline DHT is for everybody else. It turns out the views client is a lot nicer, but it's also complicated, and so lots of people just haven't bothered to do separate implementations yet. So in this talk, I've only implemented this stuff on views because that was what we had code lying around for already. But you could extend all of this to mainline, too. Okay, so we've got a DHT crawler. What can we do with it? Well, the first thing we can do is build a bit torrent search engine. So what do you want to do? We're going to build a bit torrent search engine from scratch. So unlike that lead paper I mentioned earlier in the talk, we're not going to look at any of the existing torrent sites, and we're not going to look at any existing trackers. Instead, we're going to be able to do is recover fast, even if all of the existing torrent sites and search engines were sued into oblivion overnight, no backups. So this gives us a way to recover even if everybody gets sued. And then what's our site going to look like? Well, we're going to provide keyword search, just like every other bit torrent search site, and you're going to get a magnet link to your torrent. So how do you design a search engine? Well, it's crawl index search is the general design. So think about Google, right? They've got their crawler, Googlebot, that goes out and downloads web pages. They do their indexing with some massive map-reduced job or something else who knows. They won't really tell anybody. And then the search part is whatever the back end is for the Google.com web page. It handles your search query. The other cool thing about Google is that they had a really nice ranking metric in terms of page rank, which let you measure the goodness or popularity of a page based on incoming links. So how does our search engine work? Well, again, for the crawler, we just use the DHT crawler that I just talked about. The indexing step is just a Python script that imports the result, the DHT, into Postgres. And the search is Postgresco's keyword search. For ranking, we actually have it a lot easier than Google because we're getting those lists of peers that normal clients would have gotten from the DHT. So if you want to know how popular your torrent is, you just count the size of that peer list, and that's how many people are downloading it, which is probably a pretty good guess at how popular it is. But there's a problem. So if you go read that paper, you build a crawler for the views DHT, and you actually download the whole thing, you're going to find out that what you got are pairs of the SHA1 hash of the info hash of torrents, and a list appears. The problem is you can't build a torrent site out of that because you need to give people the info hash, and the whole point of SHA1 is that it's supposed to be one way. It's hard to reverse. So you can't calculate the info hash of the torrent from what you get there. So the way we worked around this is we found this other record type in the views DHT called a torrent description. The views client actually stores these in the DHT to support its own prototype search feature. That's not, as far as I know, it's not exposed in the interface yet. And the torrent description looks like that last point on the page. This is be encoding if any of you are familiar with the details of BitTorrent. Again, the important parts are in bold. So there's the title of the torrent, which is Past and Furious here, and then there is the info hash, a big hex string in bold. So if we take this torrent description, we can connect the title and the info hash to the list of peers just by calculating the SHA1 of the info hash ourselves and matching it up with our database using a join. So that's how we get over the SHA1 problem, and that's what lets us build the search engine. The indexing is just a Python script that does a pass over the logs and does a bunch of insert statements in the Postgres. Nothing too exciting there. The search, likewise, is just a big SQL query and you could build the website in PHP or Django or whatever web framework you'd like. So you put this all together and you can build what I'll show you in a second is a big search engine pretty fast. Now, the other side, I guess the dark side or the light side of this, depending on which side you're on, is that you can also spy on BitTorrent users if you want to do things like sudo, right? So how does this work? Well, you do the same crawl that you did for search. You just repeat it over time. So you run your crawler many times over the weeks, days, years, whatever. And you run the same importer and you get two database tables. You get the peer lists that you would have gotten from the DHT and then you get a table full of torrent descriptions. So if you want to take an IP and see what files they downloaded so that you can tell the right people to sue them or take files and map it to IPs, you just do a join on these two tables, which is a really basic database operation where you take two tables and see where they're the same. And again, you just do it by taking the shot that you got from the DHT. And that gives you content to IP mapping. So, the rest of the talk, so how well does this work? So what we did was we ran our crawler for two weeks and two days. We did three crawls per day, one over eight hours. And we set the crawl to simulate 8,000 nodes in the DHT a fourth of here into two hours. Our model for how well the crawler works predicts that this should show you about 20% of the contents of the DHT. So this isn't a full crawl, but you'll see in a second we can do it really fast. So how big a search engine do you get? Well, our average crawl created a search engine that would have had about a million torrents in it. For comparison, this is a pretty good fraction of what the Pirate Bay has. They've indexed about 2.8 million torrents last time I checked. So it's not a complete replacement. People aren't going to build torrent sites like this if they don't have to. But it's a pretty good fraction and it turns out we find that the most popular torrents are a lot more popular than the rest, so torrents don't have a long tail distribution. So as long as we get most of the most popular torrents we're actually doing really good. How long does it take? The crawler runs for about an hour and 20 minutes, 80 minutes, and you have to wait about 20 minutes for all the import into the database and the indexing to happen. It's a total of 100 minutes, which is less than 2 hours, so you can build your search engine pretty damn fast. You can go even faster if you've got more bandwidth, so if you want to pay Amazon it's pretty easy too, great. And if you want to wait longer you can build a bigger search engine. How well does monitoring work? Over that 16-day period we tracked 8 million IP addresses downloading 1.5 million torrents. So a pretty sizable number, especially for just two weeks. Now, to give you kind of a flavor of the kind of spine we're able to do, first we looked up the top seven torrents. There's been a ton of studies that did this, but this one is direct from the DHT, unlike that other study there was some accusation that their top torrents were fake. I attempted to verify and most of these were not fake. So if you look at this list, what you notice is that they're all popular recent TV and movies and they're probably not legit, right? It seems unlikely that these are legally distributed. It looks like somebody's probably infringing copy right at recent US by distributing this. And by the way, that's the second to last episode of The Lost, not the last one. Unfortunately, we had to stop right before the last episode of The Lost came out, which ran really cool, but that's life. So it looks like the top torrents are all pirated, but what about maybe the rest of it's just Linux ISOs? So we dug through the list. I looked at the file names and the titles for the top thousand torrents and there was nothing in there that was obviously legit, except for one torrent. There was one legit thing in the top thousand and it was a subscription to a search for porn. So as far as we can tell, it doesn't look like y'all are mostly downloading Linux ISOs on this particular DHT. So you're going to have to do another study if you want something that makes it look like it's legit. Sorry, I couldn't really conclude that. The other thing we can do is track the popularity of stuff over time. So this is the penultimate episode of Lost. It aired on Tuesday, May 18th. It becomes available on BitTorrent in the crawl immediately after it airs. You guys are fast. But it's not very popular until Friday rolls around. So you can make two arguments with this. You could try to say that, well, people are just using BitTorrent to time shift their viewing of Lost because they missed it on Tuesday. The other argument you can make, which is probably equally balanced, this doesn't really say either way, is that pirates are lazy and they don't have time to torrent stuff until Friday. So this can go either way. Looks like I've got a little bit of... So that gives you kind of a flavor of the monitoring we can do. If I've got time, I'll show you a couple more charts. But let me start wrapping it up. So with this DHT crawler, if you crawl to use DHT, you can do two figs. You can create big BitTorrent search engines fast, and you can spy on 8 million people in two weeks. Now, one thing we can conclude from that is that there's not really a whole lot of point in all these lawsuits against torrent sites, right? There's not really cut off the head and the body will die. It's more like even if you cut off all the heads, we'll just grow new ones in two hours. So what's really the point? The flip side of this is that even moving to the DHT is not going to magically turn BitTorrent into some anonymous protocol, right? Even if there is only DHT tracking, it's not going to help people hide because we can spy on people using just the DHT. So what we expect to see in the future is DHT poisoning, so attacks that make it harder to use the DHT to find peers and harder to do this kind of search engine building. And if you want to continue stopping, if you want to continue trying to stop BitTorrent by suing people, the only people really left to sue are the users and the releasers, which I imagine is a big portion of the audience. So look out for lawsuits. So here's a couple of links. The top one there is just the basic views page about the DHT. The second one there is really important. It's on the slides, on the CD. It's going to have updated slides. It's got a link to the paper that goes with this talk, which has more pretty charts, the client for the views DHT. And then those are the two papers I referenced. The top one there is spying the world from your laptop. Bottom one is our paper, which talks about how to build a views DHT crawler. I think I got one minute, so I'll show you one more pretty chart. This is torrent popularity on a log-log scale. So the scale is not linear. The little inset there has a linear scale. And just the cool thing about this is that the most popular torrents are really popular, but popularity falls off super fast. So if you catch, like, the top, a 17% of torrents, you've got almost all of what's interesting. Right, man, I think I'm out of time, so I will take questions in the Q&A room. Thank you.