 All right, how's everybody doing? Having a good time? All right, how many people here use TOR? All right, how many people trust TOR? No, there we go. How about that? That's good to see. Well, these two gentlemen have been working on a project that is going to help us maybe perhaps trust TOR a little bit more, or at least find those people that are out there messing around with TOR, making it untrustworthy. Let's give these two guys a big round of applause. Thank you for the introduction. It seems that we have a problem with the slides. I'm not sure enough. So, apparently, they're working on fixing it. In the meantime, we're gonna start. I'm Guevara Nubir, and this is a joint presentation and work with Amir Ali Sanatina. We are both from Northeastern University. We work on developing security privacy techniques, building systems to enhance security. We're also interested in investigating the potential of attacks on the real-world systems, and therefore this work. So, unfortunately, you cannot see the slides, but the talk is about something called honey audience that we develop, and it is about exposing snooping, TOR, HSDIR relays. So, this is in the context of TOR. A large number of people use it. It's quite popular. We're interested in understanding how many of the TOR relays are misbehaving. These are relays that the host was called the Hidden Services. And in fact, this issue that we're gonna be talking about is known to the TOR people, and they've been working on having long-term solution and also short-term solution. Our interest was in knowing how many today of these relays are misbehaving. The next slide that you can see, unfortunately, is about what is TOR. TOR is a very powerful and popular tool for enhancing privacy. I personally, yes, so, so, I personally regularly use TOR. Any time, for example, I want to check something that has to do with Heldt or any when I Google something, I don't want to have the whole world know about it, so I use it, and I feel that among the systems that exist today, it's really one of the best systems that one can use. So maybe one question to the audience. So the first question was how many of you know about it? So how many of you use it? Okay, fair number. So how many of you run a relay? So how many of you have a hidden service that's running? Okay, I see much less numbers, and it could mean that maybe some of you don't want to even disclose this information. So this work is about showing that, well, in fact, you can't, while you can't trust maybe TOR about various kinds of things, you cannot trust that your hidden service, the existence of it is not hidden, and this is the goal of this work. So TOR provides two types of services that are quite well known. One of them is that you can browse the internet anonymously, and it says that you can go to some website, and the goal is that your ISP, the website, would not know that you're doing so. The type of services called hidden services. It allows one to run a server, a website, for example, and no one would be able to know what's this IP address, and therefore it's a physical location. So TOR is used by a large number of people. I think they have over a million people every day that would use it. Most of these users are normal users. So most of them try to browse the internet, don't want to reveal things about themselves. Some people try to circumvent censorship. So all these are, you know, reasonable applications. You also have a fair number of journalists. Whenever they want to communicate with their sources or they want to access information, they don't want to reveal that. So that is another type of use of TOR. You have also activists, whistleblowers. They don't reveal information about themselves, who they're talking to, what information is being shared. You also have law enforcement and military use. They don't reveal information about what they're doing. They also sometimes they also want to hide among the group of larger people who are much more normal. And obviously you have also criminals who use TOR for their activities. In fact, I don't see, you don't see the slides, but so the next one is about hidden services. A hidden service is basically this capability of being able to run this website or server and hide its physical location. It has other side effects that are quite interesting. One of them is by reading your website as a hidden service, you can hide, you'll have this self-authentication. You don't need certificates because the onion address itself includes information by the public key and allows you to self-authenticate. It allows you also to have end-to-end encryption. And there are many systems besides websites that use TOR hidden services. SecureDrop, for example, used by the New Yorker or the Guardian, allows one to communicate with journalists and provide information to journalists. But even mainstream systems like Facebook, they have a hidden service. You have also applications like Ricochet that allows secure messaging and every client runs as a hidden service so you can hide the identity, the clients who's talking to him and you also get rid of central entities. You also have other types of people who use it, like the Silk Road, for example, run as a hidden service. You have ransomware like CryptoLocker used hidden services so that they could hide the location of the server that collects the bitcoins that people would have to pay. So there's a variety of people who would use it. So maybe to also clarify, you know, for this talk, so TOR claims to have a set of properties, but it does not really claim, for example, that it will give you, if you go use a TOR browser, you don't have any guarantee that you have end-to-end encryption because if you just browse a website that does not have HTTPS, any traffic going from the exit node would be in the clear. For hidden services, when you create a hidden service, there's no guarantee for you that that hidden service, its existence is protected. TOR aims at, at least in what they have as a system today is that you can't really tell where the location, but not the fact that it exists. And this work was about finding out how many relays misbehave, trying to get this information as an indicator of other malicious activities. Within the space of, in general, looking at privacy infrastructure and attacks on it. So whenever you have a privacy infrastructure, various kinds of entities try to attack it maybe to reduce its popularity or to misuse it. I mean, cryptography in general, used for good things, a lot of people try also to misuse it. And there has been work related to this, trying to find out how many of the exit relays would be snooping into the traffic of users. Which is different from what we are doing. Other work looked into all these hidden services, what kind of information, what kind of content do they have? So here, what we want to know is out of the TOR relays, the subset of them that can serve as, have this hsdir flag, therefore they host descriptors about the hidden services, how many of them are misbehaving and misbehaving, meaning that they log information that they're not supposed to do. So whoever is running, they modify the code to be able to log this information. And later on, they might visit these websites. And as I mentioned, this is a problem that is known to the TOR people and they've been working on resolving it. And they have other techniques that they use to identify these misbehaving relays. Our techniques have the advantage that we can cover a larger scale of misbehaving devices. So this is not really about breaking the privacy of TOR in terms of you browse someplace, but more about the hidden service existence. The questions we try to address, there are four of them. The first one is how many of the TOR relays are misbehaving in the sense that I defined? Oh. Yeah. This makes my life easier. So there are four questions we're trying to address. One of them is how many of them are misbehaving? And if you could have a small number, lower bound to that number, we have an idea about how much misbehavior is happening in TOR. The next thing is that which one of them are snooping in terms of trying to find out information that they're not supposed to collect? The third one, what do they really do? How much are they just collecting information? Do they try to attack, are they aggressive or not? And the last one is who they are really, besides what relay and what's IP addresses. So we have addressed mostly the first two questions and a little bit of the third one, the last one, but really who they are that we didn't really solve. And this might be, you know, nice community for looking into that and pushing this work to the next stage. So first maybe I'll explain a little bit how hidden services work. This diagram somehow summarizes that. So to run a hidden service, what you do, you pick a random public key. And some people will go and select one that will end up in an onion address that they prefer, like Facebook has a nice one. But typically you pick a random public private key, you hash the public key in a specific way that gives you the dot onion address. Then what you do, you pick a subset of the relays, it's called introduction points, a few of them, and you set up a TOR circuit to them. These introduction points will help people come back to you later on. Then you hash your dot onion address with time information and other things, and that gives you a descriptor, a descriptor ID. And it also tells you which relays with this HSDIR flag you should put this descriptor information with. And you're going to find the two and then you end up with a set of six relays with the HSDIR flag where you'll put the information. Now, on the other hand, at the same time, you give your dot onion address to whoever you want to communicate with you. And that's in, and then in step three, this client, he'll take the dot onion address and he will hash it the same way you did. And every day is going to give him the descriptor IDs that will tell him which HSDIR relays he should come to. So in step three, he will go to these relays and ask them what are the introduction points to be able to talk to this hidden service. In step four, the client will also select something called the rendezvous point. Some other relay, he sets up a circuit to him, relays with the HSDIR flag, which one of them are misbehaving in the sense that I defined earlier. So now going back to just a little bit more specific about the HSDIR. So you have the, these relays, they have identifiers and that will show up in something called the ring of HSDIR identifiers. And your onion address, once you hash it, gives you this descriptor ID and you find the first HSDIR after your descriptor ID. And you pick the first three, I mean the first one, second and third. And then you hash it again in a different way and gives you another descriptor ID. And then you find the other two after, or the other three after that. And you have now three and three relays that will host information how you would be reached. The reason why it changes every day and you take more than one is to have reliability and protect against denial of service. Such that if you're always hosting information in the same location, someone who wants to block you, he might have that information and won't serve it to anyone. The side effect is that whenever they host that information, they can log it and they can go visit your system that you don't want to leak. So our system, how to detect who's misbehaving, the idea is quite simple. We can create a large number of these things, we call honey onions like honey pot. We set them up in a secure way in the sense that we follow all the instructions that they're not going to leak. We don't tell anyone about them. We know that if someone comes to visit our service, then whoever had the information leaked it, logged it and maybe gave it to someone and leaked it. So that is the fundamental idea. But it is not that trivial because information every day, you're going to give it to six people or six relays. So it could be any of the six. We'll have to find out who out of the six. Since we want to look at the global scale, we generate several batches. Every day, we generate some number. Every week, another number and every month, another batch. The reason for this is that some of them will collect these onion addresses, but they won't visit them immediately. They will wait for a few days to confuse us that it might be someone else. So we could compute that we need 1,500 onions to be able to cover more 95% of these relays. So every day, we generate 1,500. Every week, 1,500. And every month, 1,500. And then we see who's visiting. Some of you remember there was a peak in the tour number of hidden services. So I don't have time to comment on that, but it is not us for various reasons. I mean, that number was much larger than what we generated. We generated, at any moment, we had 4,500 onions in the system. So before I bore you a little bit with some math, I'll just show you the reason. So we choose this name, honey onion, because it made sense. Then we go a little bit and we found that, oh, it has a meaning. And it was, in fact, quite interesting that the meaning really matched what we were doing. So we couldn't resist using it. So how do you find that who's misbehaving? The idea is quite, again, so we create the onion, we put it on six places. If one of those is misbehaving, we know that one of the six is bad. But then, as we create more, they might end up in different locations. And now we have that, for these two onions, the visits, there are two people who could explain it. And then later on, we put more and you can tell what these are maybe not the guy, depending on the assumption and so on. You can see that there is some maybe possible way how one could find out about who is misbehaving. So the architectural system that we built, in step one, you generate all these onions that you're going to place on these servers. Then some of them get visited. Whenever there's a visit, it gives us information that all who knew about it should become suspicious. We put them in some graph that we built. It's called a bipartisan graph that has nodes that correspond to onions and nodes that correspond to relays. And whenever there's a visit for the onion, we put all the edges to all the relays that had this information. So since I'm running a little bit out of time, so maybe get to what we exactly did. So I'm not going to maybe talk about details of the math. The first thing is that we wanted to know what is the smallest set of suspicious of relays that could explain the visits that we see. That will tell us the lower bound on how many of these relays are misbehaving. If you find the smallest set that explains the visits, we know that there should be more than that that are misbehaving. And there's a way to formulate it, but I'm going to skip maybe that math. This is not necessarily a trivial problem. There's some heuristics that could give some approximation. And we could formulate this as something called integer linear program. Basically, to each relay, we're going to give some variable, either zero or one. One meaning that it is malicious. And we want to minimize the sum of these XIs such that you find the smallest set. But we need to explain every visit. And this can be solved with something called ILP solver. And before we tell you what we found, why we trusted this technique works reasonably well. Well, we also did some simulations selecting some of these to be malicious and so on. And you can see that we can get between 97% accuracy to 81, assuming 81 means that there was a significant number of malicious ones. So now I'm going to pass the talking to Amir Ali, who's going to tell you the results of the experiment. Good thing the slides are back because otherwise I didn't have much to talk about. Here we can see from left to right on the bottom, you can see our schedule for daily, weekly and monthly visits. And the reason behind having three schedules is if adversary visits a honey and immediately we can catch them in our daily. But if they would wait for a why, then they won't show up in the daily. But we can still spot them in the weekly and monthly. The other thing, as was mentioned by Gohara, the rise in the number of honey, onion services, we only, at each point in time, we only had 4500. But the number of increasing onion addresses was at least more than the magnitude, much more than what we had. So we are sure it's not us. And we started our experiments on February 12th. And most of the results that we explained are based on the 72 days that we are running this experiment, although we have them for further and we discuss them later. We are sure that the visits are not a result of the rise in the number of onion addresses because the increase was happening on 18th of February. And we can still see visits even happening on 12th and 13th of February. So it happened before the increase. The other thing to mention in the daily graph, you can see there are not many visits during the peak. One of the reason is that to get the HSTR flag, it would take 96 hours or four days. So after people saw there are a lot of new onion addresses, they probably set up new relays and it took them four days to get the HSTR flag. And after that, they started probing. The other things that you see more visits on weekly and monthly because they are running for longer time. So these adversaries or the malicious HSTRs had more time to visit them. This is an example of a typical connectivity graph that we had. For example, the gray circles in the middle are the onions, the visited onions. And the black ones above them are the HSTRs that are picked by ILP and that explain the visits. All the other colorful nodes are the other HSTRs who have been hosting these onions. As you can see, for example, for the orange one top right, that one is more trivial to pick because you only have one HSTR who visits both of two onions. But the power of ILP really comes in the cases with the purple one, the top left. When you have many HSTRs who have been hosting many of the visited onions, but you want to know what is the lower bound or who are the most likely HSTRs. And that's where really our technique and ILP comes to power. And we can pick these four and identify these are the most likely suspicious HSTRs. And apparently we are running out of time. So the snooping behavior, we saw some of them visiting everything. And we are hosting Alibaba. They were visiting most of the onions. And they tore people. We identified them. We talked to tore people. And after a while, they become more advanced and they delay their visit. The bottom left graph. And the geographical location, we mostly see them in Europe and Northern America. It's because tour is more, and it's also representative to use the shop tour. You don't have it much in Middle East and China because mostly blocked. And this location doesn't necessarily mean that these are the countries who are snooping. These are the relays that are located in these countries, not the country themselves. And to give you more statistic, more than half of them are hosting on cloud infrastructure. And they also had the exit flag, about 25%, which is much more than what you would have. And some of them are doing some attacks. And some of them are less aggressive. So maybe just one final comment. So since we've done this work, in fact, whoever was snooping changed their behavior. And now, in fact, you can see that most of them delay. They don't really visit quite immediately. They wait for days and sometimes a week before they do the visits. So this is still an interesting problem. They could stop here. Sorry that we couldn't really talk about the last part in detail.