 Okay. So, I'll start off by letting you know who I am. Generally, this is a different audience for me, but I'm hoping that there's some things that we can share here. So, I work for a tour project. I work in a team that is currently formed of two people on monitoring the health of the tour network and performing privacy-preserving measurement of it. Before tour, I worked on active internet measurement in an academic environment, but I've been volunteering with tour projects since 2015. If you want to contact me afterwards, this is my email address. If you want to follow me on the pediverse, that's my webfinger ID. So, what is tour? I guess most people have heard of tour. Maybe they don't know so much about it. Tour is quite a few things. It's a community of people. We have a paid staff of approximately 47. The number keeps going up, but 47 last time I looked. We also have hundreds of volunteer developers who contribute code, and we also have relay operators that help to run the tour network, academics, and lots of people involved organizing the community locally. We're registered in the U.S. as a nonprofit organization, and the two main things that really come out of tour project are open-source software that runs the tour network and the network itself, which is open for anyone to use. Currently, there are an estimated average two million users per day. This is estimated, and I'll get to why we don't have exact numbers. Most people, when they're using tour, will use tour browser. This is a bundle of Firefox and a tour client set up in such a way that it's easy for users to use safely. When you are using tour browser, your traffic is proxied through three relays. So, with a VPN, there is only one server in the middle, and that server can see either side, knows who you are and where you are going, so they can spy on you just as your ISP could before. The first step in setting up a tour connection is that the client needs to know where all of those relays are, so it downloads a list of relays from the directory server, and we're going to call that directory server Dave, and our user Alice talks to Dave to get a list of the relays that are available. In the second step, forms a circuit through the relays, and then connects finally to the web server that Alice wants to talk to, in this case Bob. If Alice later decides they want to talk to Jane, they'll form a different path through these relays. And we know a lot about these relays. Because the relays need to be public knowledge for people to be able to use them, we can count them quite well. So, over time, we can see how many relays there are that are announcing themselves, and we also have bridges, which are a separate topic, but these are special purpose relays. Because we have to connect to the relays, we know their IP addresses, and we know if they have IPv4 or IPv6 addresses, so as we want to get better IPv6 support in the Tor network, we can track this and see how our network is evolving. And because we have the IP addresses, we can combine those IP addresses with GOIP databases, and then that can tell us what country those relays are in with some degree of accuracy. And recently we've written up a blog post about monitoring the diversity of the Tor network. The Tor network is not very useful if all of the relays are in the same data center. We also perform active measurement of these relays. So we really analyze these relays because this is where we put a lot of the trust in the Tor network. It's distributed between multiple relays, but if all of the relays are malicious, the Tor network is not very useful. So we make sure that we're monitoring this diversity, and the relays come in different sizes, so we want to know are the big relays spread out, is it just a lot of little relays that appear to be inflating the numbers of individual relays? So when we look at these two graphs, we can see that the number of relays in Russia, okay, it's just over 250 at the moment, but when we look at the top five by the actual bandwidth they're contributing to the network, they drop off and Sweden takes their place contributing around 4% of the capacity. The Tor metrics team, as I mentioned, we're two people, and we care about measuring and analyzing things in the Tor network. There are three or four repetitive contributors, and then occasionally people come along with patches or perform a one-off analysis of our data. We use this data for lots of different use cases, one of which is detecting censorship. So if websites are blocked in a country, people may turn to Tor in order to access those websites. In other cases, Tor itself might be censored, and then we see a drop in Tor users, and then we also see, as I mentioned the bridges earlier, special purpose relays that can be used to circumvent censorship we would see a rise in those users, so we can interpret the data in that way. We can detect attacks against the network. If we suddenly see a huge rise in the number of relays, then we can suspect that, okay, maybe there is something malicious going on here, and we can deal with that. We can evaluate effects on how performance changes when we make changes to the software. So we've recently made changes to an internal scheduler, and the idea there is to reduce congestion of relays, and from our metrics, we can say that we have a good idea that this is working. And probably one of the more important aspects is being able to take this data and make the case for a more private and secure internet, not just from a position of, I think we should do this, I think it's the right thing, but here is the data, here are facts that we can argue with that can't easily be disputed. We only handle public, nonsensitive data. Every analysis that we do, we have reviews before we publish them. So as you might imagine, the goals of a privacy and anonymity network doesn't lend itself to easy data gathering and extensive monitoring of the network. The research safety board, if you're interested in doing research on tour or attempting to collect data through tour, can offer advice on how to do that safely. Often this is used by academics that want to study tour, but also the metrics team has used it on occasion where we want to get second opinions on deploying new measurements. And what we try and do is follow three key principles. So data minimization, source aggregation and transparency. The first one of these is quite simple and I think with GDPR probably is something people need to think about more even if you don't have an anonymity network. Having large amounts of data that you don't have an active use for is a liability and is something to be avoided. Given a data set and given an infinite amount of time that data set is going to get leaked. The probability is just increasing as you go along. So we want to make sure that we're collecting as little detail as possible in order to answer the questions that we have. When we collect data, we want to aggregate it as soon as we can to make sure that sensitive data is existing for as little time as possible. So this means usually in the tour relays themselves before they even report information back to tour metrics. They will be aggregating data and then we will aggregate the aggregates. So this can also include adding some noise, binning the values. All of these things can help to protect the individual. And then being as transparent as possible about our processes so that our users are not surprised when they find out that we're doing something, relay operators are not surprised and academics have a chance to say, whoa, that's not good. Maybe you should think about this. So the example that I'm going to talk about is counting unique users. So users of the tour network would not expect that we are storing their IP address or anything like this. They've come to tour because they want the anonymity properties. So the easy way, the traditional web analytics, keep a list of all of the IP addresses, count up the uniques, and then you have an idea of the unique users. And then you could do this combining with the GeoIP database, you can get unique users per country and these things. We can't do this. So we measure in directly. And in 2010 we produced a technical report on a number of different ways we could do this. And it comes back to Alice talking today. So because every client needs to have a complete view of the entire tour network, we know that each client will fetch the directory approximately 10 times a day. So by measuring how many directory fetches there are, we can get an idea of the number of concurrent users of the tour network. So relays don't store IP addresses at all, they count the number of directory requests, and then those directory requests are reported to a central location. We don't know how long an average session is, so we can't say, okay, we had this many unique users, but we can say concurrently we had this many users on average. We get to see trends, but we don't get the exact number. So here's what our graph looks like at the moment. We have the average 2 million concurrent tour users. This peak here, I think, was an attempted attack, and possibly this one as well. Often things happen and we don't have full context for them, but we can see when things are going wrong, and then we can also see when things are back to normal afterwards. So this is in a class of problems called the count distinct problem, and these are our methods from 2010, but since then there's been other work in this space. So one example is hyperlog log. I'm not going to explain this in detail, but I'm going to give it like a high level overview. So imagine you have a bit field and you initialize all of these bits to zero. You take an IP address, you take a hash of the IP address, and you look for the position of the leftmost one. So how many zeros were there at the start of that string? You then say there's three. You would set the third bit in your bit field. At the end, you have a series of ones and zeros, and you can get from this to an estimate of the total number that there are because every time you set a bit, there's 50% chance that that bit would be set given a number of distinct items that you've seen. So there's a very complicated proof in the paper. I don't have time to go through here, but this is one example. It actually turns out to be a fairly accurate estimate for counting unique things. This was designed for very large data sets where you don't have enough RAM to keep everything in memory. It turns out that we have a similar variant on this problem in that even keeping two IP addresses around in memory for us would be a very large data set. So we use this to avoid storing even small data sets. A private set using cardinality is another example. In this one, you can look at distributed databases and find unique counts within those. Unfortunately, this currently requires far too much RAM to actually do the computation for us to use this, but over time these methods are evolving and they should become feasible hopefully soon. And then moving on from just count distinct, the aggregation of counters. We have counters such as how much bandwidth has been used in the Tor network, and we want to aggregate these, but we don't want to release the individual relay counts. So we are looking at using a method called privcount that allows us to get the aggregates total bandwidth used while keeping the individual relay bandwidth count secret. And then there are similar schemes to this. Repor and Proclaw from Google and Prio that Mozilla have written a blog post about that are similar technologies. All of the links here are in the slides, which are also on the page on the FOSDEM schedule, so don't worry about writing these down. And then finally, I am working on putting together some guidelines for performing safe measurement on the Internet. This is targeted primarily at academics, but also if people wanted to apply this to analytics platforms or monitoring of anything that has users, and you want to respect those users' privacy, then there could be some techniques in here that are applicable. Okay, so that's all I have. If there are any questions? Question? Hey, I have a question about how many users have to be honest so that the network stays secure and private or relays? Okay, so at the moment, when we're collecting statistics, we can see, as I showed the active measurement, we know how much bandwidth a relay can cope with, and then we do some load balancing. So we have an idea of what fraction of traffic should go to each relay. And if one relay is expecting a certain level of traffic and it has wildly different statistics to another relay, then we can say, okay, this one is cheating. There's not really any incentive to do this other than to mess up our data, and it's something we can detect quite easily, but we are also working on more robust metrics going forwards to avoid this being a point where it could be attacked. Hi, thanks for the presentation. So a few days ago, I heard that, like, basically with your time metrics, you know that you have between 2 million and 8 million users, and you don't really know what the real number is, so can you tell a bit more about the variance or which method is more accurate? Okay, so the 8 million number comes from the prep count paper, and they did a small study where they looked at unique IP addresses. So it's possible that our... So they looked at unique IP addresses over a day. We look at concurrent users. So they're two different measurements. What we can say is that we know for certain that there are between 2 million and 25 million unique users per day. We're not sure where in there we fall, and 8 million is a reasonable-ish number, but also they measure IP addresses and some countries use a lot of NATs, so it could be more than 8 million. It's... Yeah, it's tricky, but we see the trends. Any other questions? There's one down here. Your presentation actually implies that you are able to collect more private data than you are doing. So it says that the only thing preventing from private user data to be collected is the team's good will and good intentions. So have I got it wrong? No, okay. So the question exactly is that are there any possibilities for the Tor project team to collect some private user data? Tor project does not run the relays. So we write the code, but there are individual relay operators that run the relays, and if we were to release code that suddenly started collecting lots of private data, people would realize, and they wouldn't run that code. So there is a step between the development and it being deployed. I think that it's possible that other people could write that code and then run those relays, but if they started to run enough relays that it looked suspicious, then people would ask questions. So there's a distributed trust model with the relays. And we... Yeah. I'll also mention at 3 o'clock there's a relay operators meetup if you're interested in running a Tor relay. I don't remember the exact room, but it's on the Tor project block. Any other questions? One more? Yeah. Okay, last question. So you talk about privacy preserving monitoring, but also a couple of years ago, we learned that the NSA with the XK store program was able to monitor relays and learn exactly which user was connecting to Tor relays. So is there also research on how to make sure that, like, other Tor user... I cannot be targeted as using a Tor relay and never being able to be monitored? Yes, there is. There's lots of research in this area. One of them is through obfuscation techniques where you can make your traffic look like something else. So there's a lot of... They're called pluggable transports and they can make your traffic look like all sorts of things. All right, thank you. So I will be outside here if anyone wants to come and ask any more questions. And I guess we'll be...