 Hello, can you hear me? Yes. OK, so my name is Gareth Owen. I'm from the University of Portsmouth. I'm an academic and I'm going to talk to you about an experiment that we did on the tour hidden services, trying to categorise them, estimate how many there were, etc. But as we go through the tour, I'm going to explain how tour hidden services work internally and how the data was collected, so what sort of conclusions you can draw from the data based on the way that we've collected it. Just so I get an idea, how many of you use tour on a regular basis? Could you put your hand up for me? So quite a big number of us. Keep your hand up if or put in need, put your hand up if you're a relay operator. Wow, that's quite a significant number, isn't it? And then put your hand up and or keep it up if you run a hidden service. OK, so a fewer number, but still some people that run hidden services. OK, so some of you may be very familiar with the way tour works in the low levels. I am going to go through it for those which aren't, so they understand just sort of how they work. And as we go along, because I'm explaining how the hidden services work, I'm going to sort of tag on information on how the tour hidden services themselves can be anonymised, de-anonymised, and also how the users of those hidden services can be de-anonymised. If you put some strict criteria on what it is you want to do with respect to that. So the things that I'm going to go over, I'm going to work over how tour works. And then specifically how hidden services work, I'm going to talk about something called the tour distributed hash table for hidden services. If you've heard that term and don't know what it means, don't worry, I will explain what a distributed hash table is and how it works. It's not as complicated as it sounds. And then I'm going to go over our dark net data, so the data that we collected from tour hidden services. And as I say, as we go along, we'll sort of explain how you do de-anonymisation of both the services themselves and of the visitors to the servers and just how complicated it is. So you may have seen this slide, which I think was from GCHQ released last year as part of the Snowden leaks, where they said, you can de-anonymise some users some of the time, but they've had no success in de-anonymising someone in response to a specific request. So given all of you, for example, I may be able to de-anonymise a small fraction of you, but I can't choose precisely one person that I want to de-anonymise. That's what I'm going to be explaining in relation to the de-anonymisation attacks, how you can de-anonymise a section, but you can't necessarily choose which section of the users that you'd be de-anonymising. So tour tries to adjust a couple of different problems. On one part, it allows you to bypass censorship. So if you're in a country like China, which blocks some types of traffic, you can use tour to bypass those censorship blocks. It tries to give you privacy, so at some level in the network, someone can't see what you're doing. And at another point in the network that people don't know who you are, but may necessarily be able to see what you're doing. Now, the traditional case for this is to look at VPNs. With a VPN, you have sort of a single provider. You have lots of users connected into the VPN. The VPN has sort of a mixing effect from an outside of a server's point of view. And then out of the VPN, you see requests to Twitter, Wikipedia, et cetera, et cetera. And if that traffic doesn't encrypt it, then the VPN can also read the contents of the traffic. Now, of course, there's a fundamental weakness with this. If you trust a VPN provider, the VPN provider knows both who you are and what you're doing and can link those two together with absolute certainty. So you don't, whilst you do get some of these properties, assuming you've got a trustworthy VPN provider, you don't get them in the face of an untrustworthy VPN provider. And of course, how do you trust a VPN provider? What sort of measure do you use? That's a sort of an open question. So Tor tries to solve this problem by distributing the trust. Tor is an open source project, so you can go on to that git repository. You can download the source code to it and change it, improve it, submit patches, et cetera. As you heard earlier during Jacob and Roger's talk, they currently partly sponsored by the US government, which seems a bit paradoxical, but they explained in that talk that doesn't affect their judgment. And indeed, they do have some funding from other sources. And they designed that system, which I'll talk about a little bit later in a way where they don't have to trust each other. So there's some redundancy in there in trying to minimise these trust issues related to this. Tor is a partially decentralized network, which means that it has some centralized components which are under the control of the Tor project and some decentralized components which are normally the Tor relays. So if you run a relay, you're one of those decentralized components. There is, however, no single authority on the Tor network, so no single server which is responsible, which you are required to trust. So the trust is somewhat distributed, but not entirely. Now, when you establish a circuit-through Tor, you, the user, download a list of all of the relays inside the Tor network, and you get to pick, and I'll tell you how you do that, which relays you're going to use to root your traffic through. So here's a typical example. You're here on the left-hand side as the user. You download a list of the relays inside the Tor network, and you select from that list three nodes, a guard node, which is your entry into the Tor network, a relay node, which is a middle node, essentially it's going to root your traffic to a third hop, and then the third hop is the exit node where your traffic essentially exits out onto the internet. Now, looking at the circuit, so this is a circuit through the Tor network through which you're going to root your traffic, there are three layers of encryption at the beginning. So between you and the guard node, your traffic is encrypted three times. In the first instance, it's encrypted to the guard, and then it's encrypted again to the relay, and then encrypted again to the exit, and as the traffic moves through the Tor network, each of those layers of encryption are unpeeled from the data. Now, the guard here in this case knows who you are, and the exit relay knows what you're doing, but neither know both, and the middle relay doesn't really know a lot, except for which relay is your guard and which relay is your exit. Who runs an exit relay? So if you run an exit relay, all of the traffic which users are sending out onto the internet, they appear to come from your IP address. So running an exit relay is potentially risky because someone may do something through your relay which attracts attention, and then when law enforcement traced that back to an IP address, it's going to come back to your address. So some relay operators have had trouble with this with law enforcement coming to them and saying, hey, we've got this traffic coming from your IP address, and you have to go and explain it. So if you run an exit relay, it's a little bit risky, but we're thankful for those people that do run exit relays because ultimately, if people didn't run an exit relay, you wouldn't be able to get out of the Tor network, and it wouldn't be terribly useful from this point of view. So yes. So every Tor relay, when you set up a Tor relay, you publish something called a scriptor which describes your Tor relay and how to use it to a set of servers called the authorities. And the trust in the Tor network is essentially split across these authorities. They're run by the core Tor project members, and they maintain a list of all of the relays in the network, and they observe them over a period of time. If the relays exhibit certain properties, they give the relays flags. If, for example, a relay allows traffic to exit from the Tor network, they'll get the exit flag. If they've been switched on for a certain period of time and offer a certain amount of traffic, they'll be allowed to become the guard relay, which is the first node in your circuit. So when you build your circuit, you download a list of these descriptors from one of the directory authorities. You look at the flags which have been assigned to each of the relays, and then you pick your route based on that. So you'll pick the guard node from the set of relays which have the guard flag, and you exit from the set of relays which have the exit flag, and so on and so forth. Now, as I have a quick count this morning, there are about 1,500 guard relays, around 1,000 exit relays, and six relays flagged as bad exits. What does a bad exit mean? It's not good. That's exactly what it means. Yes. So relays which have been flagged as bad exits, your client will never choose to exit traffic through. And examples of things which may get a relay flagged as an exit relay or if they're fiddling with the traffic which is coming out of the tour relay, or doing things like man in the middle of a tax against SSL traffic. And we've seen various things, so there have been relays, man in the middle of SSL traffic. There have very recently been an exit relay which was patching binaries that you download from the internet inserting malware into the binaries. So you can do these things, but the tour project tries to scan for them. And if these things are detected, then they'll be flagged as bad exits. Now, it's true to say that the scanning mechanism is not 100% foolproof by any stretch of the imagination. It tries to pick up common types of attacks. So as a result, it won't pick up unknown attacks or attacks which haven't been seen or have not been known about beforehand. So looking at this, how do you de-anonymise the traffic travelling through the tour network? So given some traffic coming out of the exit relay, how do you know which user that corresponds to? What is their IP address? Where you can't actually modify the traffic because if any of the relays try to modify the traffic which they're sending through the network, tour will tear down the circuit through the relay. So there's these integrity checks at each of the hops, and if you try to sort of, because you can't decrypt the packet, you can't modify it in any meaningful way. And because there's an integrity check at the next hop, that means that you can't modify the packet because otherwise it's detected. So you can't do this sort of marker and try and follow the marker through the network. So instead what you can do, if you control... So let me give you two cases. In the worst case, if the attacker controls all three of your relays that you pick, which is an unlikely scenario that needs to control quite a big proportion of the network, then it should be quite obvious that they can work out who you are and also see what you're doing because in that case they can tag the traffic and they can just discard these integrity checks at each of the following hops. Now in a different case, if you control the guard relay and the exit relay, but not the middle relay, the guard relay can't sample with the traffic because this middle relay will close down the circuit as soon as it happens. The exit relay can't send the stuff back down the circuit to try and identify the user either because again the circuit will be closed down. So what can you do? Well, you can count the number of packets going through the guard node and you can measure the timing differences between packets and try and spot that pattern at the exit relay. So you're looking at counts of packets and the timing between those packets which are being sent and essentially trying to correlate them up. So if a user happens to pick you as your guard node and then happens to pick your exit relay, then you can de-anonymise them with very high probability using this technique. You're just correlating the timings of packets and counting the number of packets going through. And the attacks demonstrated in the literature are very reliable for this. We heard earlier from the tour talk about the relay early tack, which was the attack discovered by the search researchers in the US. That attack didn't rely on timing attacks. Instead what they were able to do was send a special type of cell containing the data back down the circuit, essentially marking this data and saying, this is the data we're seeing at the exit relay or at the hidden service and encode into the messages travelling back down the circuit what the data was. And then you could pick those up at the guard relay and say, well, it's this person that's doing that. In fact, although this technique works, and yeah, it was a very nice attack, the traffic correlation attacks are actually just as powerful. So although this bug has been fixed, traffic correlation attacks still work and they're still fairly reliable. So the problem still does exist. This is very much an open question. How do we solve this problem? We don't know currently how to solve this problem of trying to tackle the traffic correlation. There are a couple of solutions, but they're not particularly reliable. Let me just go through these and I'll skip back on the few things I've missed. The first thing is high latency networks, networks where packets are delayed and they're transit through the network that throws away a lot of the timing information. So they promise to potentially solve this problem. But of course, if you want to visit Google's homepage and you have to wait five minutes for it, you're simply just not going to use Tor. The whole point is trying to make this technology usable and if you've got something which is very, very slow, then it doesn't make it attractive to use. But of course this case does work slightly better for email. If you think about it with email, you don't mind if your email, well, you may not mind, you may mind. You don't mind if your email is delayed by some period of time, which makes these somewhat difficult. And as Roger said earlier, you can also introduce padding into the circuit. So these are dummy cells, but with a big caveat, some of the research suggests that actually you'd need to introduce quite a lot of padding to defeat these attacks and that would overload the Tor network in its current state. So again, not a particularly practical solution. So how does Tor try to solve this problem? Well, Tor makes it very difficult to become a user's guard relay. If you can't become a user's guard relay, then you don't know who the user is quite simply. And so by making it very hard to become the guard relay, therefore you can't do this traffic correlation attack. So at the moment the Tor client chooses one guard relay and keeps it for a period of time. So if I want to target just one of you, I would need to control the guard relay that you were using at that particular point in time. And in fact, I'd also know to know what that guard relay is. So by making it very unlikely that you would select a particular malicious guard relay where the number of malicious guard relays is very small, that's how Tor tries to solve this problem. And at the moment your guard relay is your barrier of security. If the attacker can't control the guard relay, then they won't know who you are. That doesn't mean they can't try other sort of side-channel attacks by messing with your traffic at the exit relay and so on, you know, that you may sort of, for example, download dodgy documents and open them on your computer and those sorts of things. Now the alternative, of course, to having a guard relay and keeping it for a very long time would be to have a guard relay and to change it on a regular basis because you might think, well, just choosing one guard relay and sticking with it is probably a bad idea. Well, actually, that's not the case. If you pick the guard relay and assuming the chance of picking a guard relay that is malicious is very low, then when you first choose your guard relay, if you've got a good choice, then your traffic is safe. If you haven't got a good choice, then your traffic isn't safe. Whereas if your tour client chooses a guard relay every few minutes or every hour or something along those lines, at some point you're going to pick a malicious guard relay. So they're going to have some of your traffic, but not all of it. And so currently the trade-off is that we make it very difficult for an attacker to control a guard relay and the user picks a guard relay and keeps it for a long period of time. And so it's very difficult for the attacker to pick that guard relay when they control a very small proportion of the network. So this currently provides those properties I described earlier, the privacy and the anonymity when you're browsing the web, when you're accessing websites and so on. But still you know who the website is. So although you're anonymous and the website doesn't know who you are, you know who the website is. And there may be some cases where for example the website would also wish to remain anonymous. You want the person accessing the website and the website itself to be anonymous to each other. And you could think about people, for example, being countries where running a political blog, for example, might be a dangerous activity. If you run that on a regular web server, you're easily identified. Whereas if you've got some way where you as the web server can be anonymous then that allows you to do that activity without being targeted by your government. So this is what hidden services try to solve. Now, when you first think about the problem you kind of think, hang on a second. So the user doesn't know who the website is and the website doesn't know who the user is. So how on earth do they talk to each other? Well that's essentially what the tour hidden service protocol tries to sort of set up. How do you identify and connect to each other? So at the moment this is what happens. We've got Bob on the left hand side who is the hidden service and we've got Alice on the left hand side here who is the user who wishes to visit the hidden service. Now, when Bob sets up his hidden service he picks three nodes in the tour network as introduction points and builds several hop circuits to them. So the introduction points don't know who Bob is. But Bob has circuits to them and Bob says to each of these introduction points will you relay traffic to me if someone connects to you asking for me? And then these introduction points do that. So then once Bob has picked his introduction points he publishes a descriptor describing the list of his introduction points for someone who wishes to come into his website and then Alice on the left hand side wishing to visit Bob will pick a rendezvous point in the network and build a circuit to them. So this RP here is the rendezvous point and she will relay a message via one of the introduction points saying to Bob, meet me at the rendezvous point and then Bob will build a three hop circuit to the rendezvous point. So now at this stage we've got Alice with a hop with a multi hop circuit to the rendezvous point and Bob with a multi hop circuit to the rendezvous point. Alice and Bob haven't talked haven't connected to one another directly the rendezvous point doesn't know who Bob is the rendezvous point doesn't know where Alice is all they're doing is forwarding the traffic and they can't inspect the traffic either because the traffic itself is encrypted. So that's currently how you solve this problem of trying to communicate with someone who you don't know who they are and vice versa. So the principal thing I'm going to talk about today is this database. So I said Bob when he picks his introduction points he builds this thing called a descriptor describing who his introduction points are and he publishes them to a database. This database itself is distributed throughout the tour network it's not a single server. So both Bob and Alice need to be able to publish information to this database and also retrieve information from this database and tour currently uses something called a distributed hash table which I'm going to give an example of what this means and how it works and then I'll talk to you specifically how the tour distributed hash table works itself. So let's say for example you've got a set of servers so here we've got 26 servers and you would like to store your files across these different servers without having a single server responsible for deciding okay well that file was stored on that server and this file stored on that server and so on and so forth. Now here's my list of files you could take a very naive approach and you could say okay well I've got 26 servers I've got all of these file names which start with a letter of the alphabet and I could say well all the files which begin with A are going to go on to server A all the files that begin with B are going to go on server B and so on and so forth. And then when you want to retrieve a file you say okay well what does my file name begin with and then you know which server it's stored on. Now of course you could have a lot of servers sorry a lot of files which begin with a Z an X or a Y etc in which case you're going to overload that server you're going to have more files stored on one server than another server in your set. And if you have a lot of big files say for example beginning with B then rather than distributing your files across all of the servers you're going to just be overloading one or two of them. So to solve this problem what we tend to do is we take the file name and we run it through a cryptographic hash function. A hash function produces output which looks like random. Very small changes in the input to our cryptographic hash function produce a very large change in the output and this change looks random. So if I take all of my file names here and assuming I had a lot more I took a hash of them and then I used that hash to determine which server to store the file on. Then with high probability my files will be distributed evenly across all of the servers. And then when I want to go and retrieve one of the files I take my file name I run it through a cryptographic hash function that gives me the hash and then I use that hash to identify which server that particular file is stored on and then I go and retrieve it. So that's the sort of the loose idea of how a distributed hash table works. There are a couple of problems with this. You know, what if you've got a changing size? What if the distributed hash table sorry the number of servers you've got changes in size as it does in the Tor network? So it's a very brief overview of the theory. So how does it apply to the Tor network? Well the Tor network has a set of relays and it has a set of hidden services. Now we take all of the relays and they have a hash identity which identifies them and we map them on to a circle using that hash value as an identifier. So you can imagine a sort of the hash value ranging from zero to a very large number. We've got a zero point at the very top there and that runs all the way around to the very large number. So given the identity hash for a relay we can map that to a particular point on the server. And now all we have to do is also do this for hidden services. So there's a hidden service address something.onion so this is one of the hidden websites that you might visit. You take the I'm not going to describe in too much detail how this is done but the the value is done in such a way such that it's evenly distributed about the circle. So your hidden service will have a particular point on the circle and the relays will also be mapped on to this circle. So there's the relays and the hidden service. And in the case of Tor the hidden service actually maps to two positions on the circle and it publishes its descriptor to the three relays to the right at one position and the three relays to the right at another position. So they're actually in total six places where this descriptor is published on this circle. And now if I want to go and fetch and connect to a hidden service I go and want to go and pull this hidden descriptor down to identify what its introduction points are. I take the hidden service address I find out where it is on the circle I map all of the relays onto the circle and then I identify which relays on the circle are responsible for that particular hidden service and I just connect to them and say do you have a copy of the descriptor for that particular hidden service and if so then we've got our list of introduction points and we can go through the next steps to connect to our hidden service. So I'm going to explain how we set up our experiment. So what we thought were really interesting to do was to collect publications of hidden services so the num so for every time a hidden service gets set up it publishes to this distributed hash table what we wanted to do was to collect those publications so that we could get a complete list of all of the hidden services and what we also want to do is to find out how many times a particular hidden service is requested. So we ran just one more point that will become important later. The position in which the hidden service appears on the circle changes every 24 hours. So it's not a fixed position every single day. So if we run 40 nodes over a long period of time we will occupy positions within that distributed hash table and we'll be able to collect publications and requests for hidden services that are located at that position inside the distributed hash table. So in that case we ran 40 tour nodes we had a student at the university who said hey you know what I run a hosting company I've got loads of server capacity and we told him what we were doing and he said well you've really helped us out these last couple of years so and just gave us like loads of server capacity to allow us to do this. So we spun up 40 tour nodes. Each tour node is required to advertise a certain amount of bandwidth to become a part of that distributed hash table. It's actually a very small amount so it doesn't matter too much and then after this has changed recently in the last few days it used to be 25 hours it's just been increased as a result of one of the attacks last week but suddenly during our study it was 25 hours you'll then appear at a particular point inside that distributed hash table and you're then in a position to record publications of hidden services and requests for hidden services. So not only can you get a full list of the onion addresses you can also find out how many times each of the onion addresses are requested. And so this is what we recorded and then once we had a full list of once we'd run for a long period of time to collect a long list of onion addresses we then built a custom crawler that would visit each of the tour hidden services in turn and pull down the HTML content so the text content from the web page so that we could go ahead and classify the content. Now it's really important to note here and it will become obvious while a little bit later we only pulled down HTML content. We didn't pull down images and there's a very very important reason for that which will become clear shortly. So we had a lot of questions when we first started this. No one really knew how many hidden services there were. It had been suggested to us that it was a very high turnover of hidden services. We wanted to confirm that whether that was true or not and we also wanted to do this. So what are the hidden services? How popular are they? Et cetera, et cetera, et cetera. So our estimate for how many hidden services there are over the period which we ran our study this is our graph plotting our estimate for each of the individual days as to how many hidden services there were on that particular day. Now the data is naturally noisy because we're only a very small proportion of that circle. So we're only observing a very small proportion of the total publications and requests every single day for each of those hidden services. Now if you take the long term average for this there's about 45,000 hidden services that we think were present on average each day during our entire study which is the large number of hidden services. But over the entire length we collected about 80,000 in total. Some came and went, et cetera. So the next question after how many hidden services there are how long does a hidden service exist for? Does it exist for a very long period of time? Does it exist for a very short period of time? Et cetera, et cetera. So what we did was for every single onion address we plotted how many times we saw a publication for that particular hidden service during the six months. How many times did we see it? If we saw it a lot of times that suggested in general hidden service existed for a very long period of time if we saw a very short number of publications for each hidden service and that suggests that they were only present for a very short period of time. So this is our graph. So by far the most number of relays sorry, the most number of hidden services we only saw once during the entire study and we never saw them again which suggests that there's a very high turner of the hidden services. They don't tend to exist on average that is for a very long period of time. And then you can see the sort of the tale here. Now if we plot just those hidden services which existed for a long time so for example we could take hidden services which have a high number of hit requests and say okay well those that have a high number of hits probably existed for a long time that's not absolutely certain probably then you see this sort of normal plot about four or five so we saw on average most hidden services four or five times during the entire six months if they were popular. If they were popular and we're using that as a proxy measure for whether they existed for the entire time. Now this data is over 160 days so all over almost six months what we also wanted to do was try and confirm this over a longer period. So last year so in 2013 about February time some researchers at the University of Luxembourg also ran a similar study but they ran it over a very short period of time over a day but they did it in such a way that they could collect descriptors across much of the circle during a single day. That was because of a bug in the way torded some of the things which has now been fixed so we can repeat that as a particular way. So we've got a list of onion addresses from February 2013 from these researchers at the University of Luxembourg and then we've got our list of onion addresses from this six months which was March to September of this year and we wanted to say okay well given these two sets of onion addresses which onion addresses existed in his set were not ours and vice versa and which onion addresses existed in both sets. So as you can see a very small minority of hidden service addresses existed in both sets. So this is over an 18 month period between these two collection points very small number of services existed in both his data set and in our data set which again suggests there's a very high turnover of hidden services you know they don't tend to exist for a very long period of time. So the question is why is that which will come on to a little bit later it's a very valid question can't answer it 100% we'll have some inklings as to why that may be the case. So in terms of popularity which hidden services did we see or which onion addresses did we see requested the most which got the most number of hits or the most number of directory requests. So botnet commander control servers so if you're not familiar with what a botnet is the idea is that you infect lots of people with a piece of malware and this malware phones home to a commander of control server where the botnet master can give instructions to each of the bots on to do things. So it might be for example to collect passwords, keystrokes, banking details what it might be to do for things like distributed denial of service attacks or to send spam those sorts of things. Now a couple of years ago someone gave a talk and said well the problem with running a botnet is your commander control servers are vulnerable. Once your commander control servers is taken down you no longer have control of your botnet. So there's been this arms race against antivirus companies and against malware authors to try and come up with techniques to run commander control servers in a way which they can't be taken down. A couple of years ago someone gave a talk at a conference that said you know what it will be a really good idea if botnet commander control servers will run as tour hidden services because then no one knows where they are and in theory they can't be taken down. So in the fact we have this there are loads and loads and loads of these addresses associated with several different botnets Sefnit and Skynet. Now Skynet is the one that I want to talk to you about because the guy that ran Skynet had a Twitter account and he also did a Reddit AMA. If you've not heard of a Reddit AMA before that's a Reddit ask me anything. You can go on the website and ask the guy anything. So this guy wasn't hiding in the shadows he's saying hey I'm running this massive botnet here's my Twitter account which I update regularly here's my Reddit AMA where you can ask me questions and so on and so forth. He was arrested last year which is not perhaps a huge surprise. But so he was arrested his command and control servers have disappeared but but there are still infected hosts trying to connect to the command and control servers and request access to the command and control servers. This is why we're seeing a large number of hits. So all of these requests are failed requests that is we didn't have a descriptor for them because the hidden service had gone away but there were still clients requesting each of the hidden services. Now the next thing we wanted to do was to try and categorise sites. So as I said earlier we crawled all of the hidden services that we could and we classified them into different categories based on what the type of content was on the hidden service site. The first graph I have is the number of sites in each of the categories. So you can see down the bottom here we've got lots of different categories. We've got drugs, marketplaces, et cetera, et cetera, et cetera along the bottom and then the graph shows the percentage of hidden services that we crawled that fit into each of these categories. So for example, looking at these drugs the most number of sites that we crawled were made up of drugs focus websites followed by marketplaces and so on and so forth. There's a couple of questions you might have here. So which ones are going to stick out? What does porn mean? Well, you know what porn means? There are some very notorious porn sites on the Tor Darknet. There was one in particular which was focused on revenge porn. It turns out that youngsters wish to take pictures of themselves and send it to their friends and then when they're friends or not to their friends, sorry, to their boyfriends or their girlfriends and when they get dumped they've published them on these websites. So there were several of these sites on the main internet which have mostly been shut down and some of these sites were archived on the darknet. The second one is, which you're probably wondering what it is, is abuse. Abuse was in every single site we classified in this category were child abuse sites. So they were in some way facilitating child abuse. Now how do we know that? Well, the data that came back from the Clawra made it completely unambiguous as to what the content was on these sites. It was completely obvious from the content from the Clawra as to what was on these sites. And this is the principal reason why we didn't pull down images from sites. In many countries that would be a criminal offence to do so. So our Clawra only pulled down text content from all of these sites and that enabled us to classify them based on that. We didn't pull down any images. So, of course, the next thing you'd like to do is to see, okay, well, given each of these categories what proportion of directory requests went to each of the categories? Now, the next graph is going to need some explaining as to precisely what it means and I'm going to give that. So this is the proportion of directory requests which we saw that went to each of the categories of hidden service that we classified. As you can see, in fact, we saw a very large number going to these abuse sites and the rest sort of distributed right there at the bottom there. Now, the question is, what is it we're collecting here? We're collecting successful hidden service directory requests. What does a hidden service directory request mean? Well, it probably loosely correlates with either a visit or a visitor. So it's somewhere in between those two because when you want to visit a hidden service you make a request for the hidden service descriptor and that allows you to connect to it and browse through the website. But there are cases where, for example, if you restart tour you'll go back and you'll refetch the descriptor. So in that case, we'll count twice, for example. Now, what proportion of these are people and which proportion of them are something else? Well, the answer to that is we just simply don't know. We've got directory requests that doesn't tell us about what they're doing on these sites, what they're fetching or who indeed they are or what it is they are. So these could be automated requests. They could be human beings. We can't distinguish between those two things. So what are the limitations? Well, a hidden service directory request neither exactly correlates to a visit or a visitor. It's probably somewhere in between. So we can't say whether it's exactly one or the other. We cannot say whether a hidden service directory request is a person or something automated. We can't distinguish between those two. Any type of site could be targeted by, so, for example, denial of service attacks, by web crawlers, which would greatly inflate the figures. If you were to do it in denial of service attack, it's likely you'd only request a small number of descriptors. You'd actually be flooding the site itself rather than the directories, but in theory, you could flood the directories, but we didn't see any sort of shutdown of our directories based on flooding, for example. So whilst we can't rule that out, it doesn't seem to fit too well with what we've got. The other question is crawlers. So I've obviously talked with the tour project about these results, and they've suggested that there are groups, so sort of child protection agencies, for example, that will crawl these sites on a regular basis. And again, that doesn't necessarily correlate with a human being, and that could inflate the figures. How many hidden directory requests would there be if a crawler was pointed at it, typically if they're crawling on a single day, one request, but if they've got a large number of servers doing the crawling, then it could be a request per day for every single server. So, again, I can't give you a definitive, yes, this is human beings or yes, this is automated requests. The other important point is these two content graphs are only hidden services offering web content. There are hidden services that do things, for example, IRC, the instant messaging, and so on. Those aren't included in these figures. We're only concentrating on hidden services offering websites. So, HTTP services or HTTPS services, because they allow us to easily classify them. And in fact, some of the other types of IRC and Java, the results are probably not directly comparable with websites. The sort of the use case for using them is probably slightly different. So, I appreciate the last graph is somewhat alarming. If you have any questions, please ask either me or the Tor developers as to how to interpret these results. It's not quite as straightforward as it may look when you look at the graph. You might look at the graph and say, hey, that looks like there's lots of people visiting these sites. It's difficult to conclude that from the results. So, the next slide is going to be very contentious. I will prefix it with... I'm not advocating any kind of action whatsoever. I'm just trying to describe technically as to what could be done. It's not up to me to make decisions on these types of things. So, of course, when we found this out, frankly, I think we were stunned. I mean, it took us several days, frankly, of just stunnedness. What the hell, this is not what we expected at all. So, a natural step as well. We think, you know, most of us think, okay, well, Tor is a great thing, it seems. Could this problem be sorted out while still keeping Tor as it is? And probably the next step to say, well, okay, could we just block this class of content and not other types of content? So, could we block just hidden services that are associated with these sites and not other types of hidden services? Well, in fact, there's three ways in which we could block hidden services. And I'll talk about whether these will be possible in the coming months after I explain them, but during our study, these would have been possible and presently they are possible. So, a single individual could shut down a single hidden service by controlling all of the relays which are responsible for receiving a publication request on that distributed hash table. Now, it's possible to place one of your relays at a particular position on that circle and so, therefore, make yourself be the responsible relay for a particular hidden service. And if you control all of the six relays which are responsible for hidden service, when someone comes to you and says, can I have the descriptor for that site, you can just say, no, I haven't got it. And provided you control those relays, users won't be able to fetch those sites. The second option is you could say, okay, well, the Tor project aren't blocking these, which I will talk about in a second, as a relay operator, could I as a relay operator say, okay, well, as a relay operator, I don't want to carry this type of content or I don't want to be responsible for serving up this type of content. Well, a relay operator could patch his relay and say, you know what, if anyone comes to this relay requesting any one of these sites, then again, just refuse to do it. The problem is a lot of relay operators need to do this. A very, very large number of the percentage of relay operators would need to do that to effectively block a site. The final option is the Tor project could modify the Tor program and actually embed these addresses in the Tor program itself so that they're all relays by default both block hidden service directory requests to these sites and also clients themselves will say, okay, well, if anyone's requesting these, block them at the client level. Now, I hasten to add, I'm not advocating any kind of action that is entirely up to other people because frankly, I think if I advocated blocking hidden services, I probably wouldn't make it out alive. So I'm just saying this is a description of what technical measures could be used to block some classes of site. Of course, there's lots of questions here. If, for example, the Tor project themselves decide, okay, we're going to block these sites, that means they're essentially in control of a block list. The block list would be somewhat public, so everyone will be able to inspect what the sites are that are being blocked, but they would be in control of some kind of block list, which arguably is against what the Tor project are after. So how about de-anonymising visitors to hidden service websites? So in this case, we've got a user on the left-hand side. He's connected to a guard node. We've got a hidden service on the right-hand side. He's connected to a guard node. And at the top, we've got one of those directory servers, which is responsible for serving up those hidden service directory requests. Now, when you first want to connect to a hidden service, you connect through your guard node and through a couple of hops up to the hidden service directory and you request the descriptor off of them. So at this point, if you are the attacker and you control one of the hidden service directory nodes for a particular site, you can send back down the circuit a particular pattern of traffic and if you control that user's guard node, which is a big if, then you can spot that pattern of traffic at the guard node. The question is, how do you control a particular user's guard node? Well, that's very, very hard. But if, for example, I run a hidden service and all of you visit my hidden service and I'm running a couple of dodgy guard relays and the probability is that some of you, certainly not all of you by any stretch, will select my dodgy guard relay and I could de-anonymise you, but I couldn't de-anonymise the rest of them. So what we're saying here is that you can de-anonymise some of the users, some of the time, but you can't pick which users those are which you're going to de-anonymise. So you can't de-anonymise someone specific, but you can de-anonymise a fraction based on what fraction of the network you control in terms of guard capacity. How about, so the attacker controls those two, here's a picture from a researcher at the University of Luxembourg which did this and these are plots of taking the user's IP address, visiting a command and control server and then geolocating it and putting it on a map. So where was the user located when they called one of the Tor hidden services? So again, this is a selection, a percentage of the users visiting command and control servers using this technique. How about de-anonymising hidden services themselves? Well, again, you've got the problem. You're the user, you're going to connect through your guard into the Tor network and then eventually through the hidden services guard node and talk to the hidden service. As the attacker, you need to control the hidden services guard node to do these traffic correlation attacks. So again, it's very difficult to de-anonymise a specific Tor hidden service, but if you think about it, okay, so there's a thousand Tor hidden services if you can control a percentage of the guard nodes, then some hidden services will pick you and then you'll be able to de-anonymise those. So provided you don't care which hidden services you're going to de-anonymise, then it becomes much more straightforward to control the guard nodes of some hidden services but you can't pick exactly what those are. So what sort of data can you see traversing a relay? So this is a modified Tor client which just dumps cells which are coming, essentially packets travelling down a circuit and the information you can extract from them at a guard node and this is done off the main Tor network. So I've got a client connected to a quote malicious guard relay and it logs every single packet or they're called cells in the Tor protocol coming through the guard relay. We can't decrypt the packet because it's encrypted three times. What we can record though is the IP address of the user, the IP address of the next hop and we can count packets travelling in each direction down the circuit and we can also record the time at which those packets were sent. So of course if you're doing the traffic correlation attacks you're using that timing information to try and work out whether you're seeing traffic which you've sent and which identifies a particular user or not or indeed traffic which they've sent which you've seen at a different point in the network. So moving on to my blimey. Interesting problems, research questions etc. Now based on what I've said I've said there's these directory authorities which are controlled by the core Tor members. If for example they were malicious then they could manipulate the Tor if a big enough chunk of them were malicious then they can manipulate the consensus to direct you to particular nodes. I don't think that's the case. I don't think anyone thinks that's the case and Tor is designed in a way to mean that you'd have to control a certain number of the authorities to be able to do anything important. So the Tor people I said this to them a couple of days ago I find it quite funny that you design your system as if you don't trust each other. To which their response was no we design our system so that we don't have to trust each other which I think is a very good model to have when you have this type of system. So could we eliminate these sort of centralized servers? I think that's actually a very hard problem to do. There are lots of attacks which could potentially be deployed against a decentralized network. At the moment the Tor network is relatively well understood both in terms of what types of attack it's vulnerable to. So if we were to move to a new architecture then we may open to a whole new class of attacks. The Tor network has been existent for quite some time and has been very well studied. What about global adversaries like the NSA where you can monitor network links all across the world? It's very difficult to defend against that where they can monitor if they can identify which guard relay you're using they can monitor traffic going into and out of the guard relay and then along each of the subsequent hops along it. It's very difficult to defend against these types of things. Do we know if they're doing it? The documents that were released yesterday I've only had a very brief look through them but they suggest that they're not presently doing it and they haven't had much success. I don't know why. There are very powerful attacks described in the academic literature which are very, very reliable and most academic literature you can access for free so it's not even as if they have to figure out how to do it they just have to read the academic literature and try and implement some of these attacks but I don't know why they're not. The next question is how to detect malicious relays. So in my case we were running 40 relays. Our relays were on consecutive IP addresses so we were running 40 well most of them were on consecutive IP addresses in two blocks so they were running on IP addresses numbered for example 1, 2, 3, 4 we were running two relays per IP address and every single relay had my name plastered across it. So after I set up these 40 relays in a relatively short period of time I expected someone from the tour project to come to me and say hey Gareth what are you doing? No one noticed. So this is presently an open question and the tour project are quite open about this they acknowledged that in fact last year we had the search researchers launch much more relays than that the tour project spotted those large number of relays but chose not to do anything about it and in fact they were deploying an attack but it's often very difficult to defend against unknown attacks. So at the moment how to detect malicious relays is a bit of an open question which is I think is being discussed on the mailing list. The other one is defending against unknown tampering at exits. If you take the exit relays the exit relay can tamper with traffic. So we know particular types of attacks doing SSL man in the middle etc. We've seen recently binary patching how do we detect unknown tampering with traffic other types of traffic. So the binary tampering wasn't spotted until it was spotted by someone and told to the tour project. So it wasn't detected for example by the tour project themselves but spotted by someone else and notified to them. And then the final one I've put on here is the tour code review. So the tour code is open source. We know from open SSL that although everyone can read source code people don't always look at it and open SSL has been a huge mess and there's been lots of stuff disclosed over the last coming days. There are lots of eyes on the tour code but I think always more eyes are better, right? So ideally if we can get people to look at the tour code and look for vulnerabilities then I encourage people to do that, right? It's a very useful thing to do. There could be unknown vulnerabilities as we've seen with a really early attack quite recently in the tour code which could be quite serious but the truth is we just don't know until people do thorough code audits and even then it's very difficult to know for certain. So my last point I think, yes, is advice to future researchers. So if you want planning on doing a study in the future for example on tour do not do what the CERT researchers do and start deanalomising people on the live tour network and doing in a way which is incredibly irresponsible. I don't think that, I mean I tend myself to give people the benefit of the doubt. I don't think the CERT researchers set out to be malicious. I think they just very naive as to what it was they were doing and that was rapidly pointed out to them. In my case we were running 40 relays. Our tour relays weren't... they were forwarding traffic, they were acting as good relays. The only thing they were doing was logging publication requests to the directory. Big question whether that's malicious or not, I don't know. One thing that has been pointed out to me is that the onion addresses themselves could be considered sensitive information so the only data we will be retaining from the study is the aggregated data so we won't be retaining information on individual onion addresses because that could potentially be considered sensitive information if you think about someone running an onion address which contains something which they don't want other people knowing about. So we won't be retaining that data and we'll be destroying that. So I think that brings me nice onto questions. The... I want to say thanks to a couple of people the student that donated the server to us. Nick Savage, which is one of my colleagues who was a sounding board during the entire study. Ivan Pysigroff who is the researcher at the University of Luxembourg who sent us the large data set of onion addresses from last year. He's also the chap which demonstrated those de-anonymisation attacks which I talked about. A big thank you to Roger Dingwedine who has frankly been presented loads of questions to be over the last couple of days and allowed me to bounce ideas back and forth. That's been a very useful process. If you are doing future research I strongly encourage you to contact the Tor product at the earliest opportunity. You'll find them. Certainly I found them to be extremely helpful. Donaccia also did something similar. So both Ivan and Donaccia have done a similar study in trying to classify the types of hidden services or work out how many hits there are to particular types of hidden service. Ivan Pysigroff did it on a big-ish scale and found similar results to us. That is that these abuse sites featured frequently in the top listed, top requested sites. So that was done over a year ago and again he was seeing similar sorts of pattern. There were these abuse sites being requested frequently. So that also corroborates what we're saying. The data that I put online is at this address. There will be copies of the slides. Something called the Tor research framework which is an implementation of a Tor client in Java specifically aimed at researchers. So if, for example, you want to pull out data from the consensus you can do, if you want to build custom routes through the network you can do, if you want to build routes through the network and start sending padding traffic down them you can do etc. The code is designed in a way which is designed to be easily modifiable for testing lots of these things. There's also a link to the Tor FBI exploit which they deployed against visitors to some Tor hidden services last year. They exploited Mozilla Firefox bug and then ran code on visitors, users who were visiting these hidden services ran code on their computer to identify them. At this address there is a link to that including a copy of the shell code and an analysis of exactly what it was doing. And then of course a list of references for papers and things. So I'm quite happy to take questions now. So thanks for the nice talk. Do we have any questions from the internet? Test? Yeah. One question, it's very hard to block addresses since creating them is cheap and they can be generated for each user and rotated often. So can you think of any other way for doing the blocking? Ah, sorry. No, that's absolutely true. So yes, if you were to block a particular onion address they can relaunch another onion address. So I don't know of any way to counter that now. Another one from Tinder. Okay, then microphone one please. Thank you, that's fascinating research. You mentioned that it is possible to influence the hash of your relay node in a sense that you could be choosing which service you're advertising or which hidden service you're responsible for. Is that right? Correct. So could you elaborate on how this is possible? So for example, if you just keep regenerating a public key for your relay you'll get closer and closer to the point where you'll be the responsible relay for that particular hidden service. So it's just you keep regenerating your identity hash until you're at that particular point in the relay. That's not particularly computationally intensive to do. Thanks. Okay, that was it, yep, okay. Okay, next question from microphone five please. Yeah, hi. I was wondering for the attacks where you identify a certain number of users using a hidden service. Have those attacks been used? Is there any evidence of that? And is there any way of protecting against that? That's a very interesting question. Is there any way to detect these types of attacks? So some of the attacks if you're going to generate particular traffic patterns one way to do that is to use the padding cells. The padding cells aren't used at the moment by the official tool client. So detection of those could be indicative but it's not conclusive evidence that they know at all. And is there any way of protecting against some I don't know government or something they're trying to denial service hidden services through this? Sorry, trying to deny. Is it possible to protect against this kind of attack? Not that I'm aware of. The tool product are currently revising how they do the hidden service protocol which will make for example what I did enumerating the hidden service is much more difficult and to also be in a position on the distributed hash table in advance for a particular hidden service. So they are at the moment trying to change the way it's done and make some of these things more difficult. Good. Next question from microphone 2 please. Hi. Yeah, I'm running the Tor2Web abuse. And so I used to see a lot of abuse requests concerning the Tor2Web hidden service being exposed on the internet through the Tor2Web.org domain name. And just wanted to comment on like you said, the abuse number of requests. I used to spoke with some of the child protection agencies that reported abuse at Tor2Web.org and they are effectively using crawler that periodically look for changes in order to get new images to be put in the database. And what I was able to understand is that the German agency doing that is scrolling the same size that the Italian agency is scrolling to. So it's likely that in most of the countries there are the child protection agencies that are crawling those few numbers of Tor2Web hidden service that contain child porn. And I saw it also a bit from the statistics of Tor2Web where the amount of abuse relating to that kind of content it's relatively low. Just a contribution. Yeah, so that's very interesting. Thank you for that. So next microphone 4 please. Yeah, then the tech to the NMIS users with a infected or a modified guard relay. Is it required to modify the guard relay if I control the entry point of the user to the internet if I miss ISP? Yeah, so if you observe traffic traveling into a guard relay but without controlling the guard relay itself. In theory, yes. I wouldn't be able to tell you how reliable that is off the top of my head. Thanks. So another question from the internet. Wouldn't the ability to choose a key hash prefix give the ability to target specific onions? So you can only target one onion address at a time because of the way that it generates it. So you wouldn't be able to say, for example, pick a key which targeted two or more onion addresses. You can only target one onion address at a time by positioning yourself at a particular point in the distributed hash table. Another form on the internet. OK, then microphone three please. Hey, thanks for this research. I think it strengthens the network. So in the team, I was wondering whether you can donate this relays to be a part of like non malicious relays. Paul basically use them as regular relays afterwards. OK, so could I donate the relays that we ran to the talk capacity? Unfortunately, so they were run by a student and they were donated for a fixed period of time. So we've given those back to him. We're very grateful to him. He was very generous. In fact, without his contribution donating these, it would be much more difficult to collect as much data as we did. OK, next microphone five please. Yeah, hi. First of all, thanks for your talk. I think you've raised some real issues that need to be considered very carefully by everyone at the talk project. My question, I'd like to go back to the issue of so many abuse related websites running over the talk project. I think it's an important issue that really needs to be considered because we don't want to be associated with that at the end of the day. Anyone who uses talk runs a relay or an exit node. And I understand there's a bit of a sensitive issue and you don't really have any say over whether it's implemented or not. But I'd like to get your opinion on the implementation of a distributed block deny system that would run in very much a similar way to those of the directory authorities. I'd just like to see why you think of that. So you're asking me about it whether I want to support a particular blocking mechanism. I'd like to get your opinion on it. I know it's a sensitive issue, but I think, like I said, I think it needs to be considered because everyone running exit nodes and relays and people at the talk project don't want to be known or associated with these massive amount of abuse websites that currently exist within the talk network. I absolutely agree. And I think the talk project are horrified as well that this problem exists. And they've in fact talked on it in previous years that they do have a problem with this type of content. As to what, if anything is done about it, it's very much up to them. Could it be done in a distributed fashion? So the example I gave was a way which it could be done by relay operators. So for example, that would need the consensus of a large number of relay operators to be effective. So that is done in a distributed fashion. The question is, who gives the list of onion addresses to block to each of the relay operators? Clearly, the relay operators aren't going to collect themselves. It needs to be supplied by someone like the talk project, for example, or someone trustworthy. So, yes, it can be done in a distributed fashion. It can be done in an open fashion. Who knows? Yeah, thank you. Good. And another question from the internet. Apparently there's an option in the talk line to collect statistics on hidden services. Do you know about this and how it relates to your research? Yes, so I believe they're going to be... The extent to which I know about it is they're going to be trying this next month to try and estimate how many hidden services there are. So keep your eye on the talk project website. I'm sure they'll be publishing more data in the coming months. And sadly, we are running out of time. So this will be the last question. So microphone for please. Hi. I'm just wondering if you could sort of outline what ethical clearances you had to get from a university to conduct this kind of research? Yeah, so we have to discuss these types of things before undertaking any research. And we go through the steps to make sure that we're not, for example, storing sensitive information about particular people, for example. So, yes, we are very mindful of that. And that's why I made a particular point of putting it on the slides as to some of the things to consider. So, like, you outlined a potential implementation of the traffic correlation attack. Are you saying that you performed the attack? No, no, no, absolutely not. Yeah, so the link I'm giving, absolutely not. So we have not engaged in any. It just wasn't clear from the slides. I apologise. Yes, so absolutely clear on that. No, we're not engaging any de-anonymisation research on the tool network. The research I showed is linked on the references thing which I put at the end of the slide. You can read about it, but it's done in simulation. So, for example, there's a program court. There is a way to do a simulation of the tool network on a single computer. I can't remember the name of the project, though. Shadow, yes. There's a system called Shadow where you can run a large number of tool relays on a single computer and simulate the traffic between them. If you're going to do that type of research, then you should use that. Okay, thank you very much, everyone. Okay, thank you.