 All right. Why don't we get started? So, my name is Will. And this talk is sort of looking at an intersection between sort of the academic world and how ethics in academics has approached this issue of measuring censorship and, you know, how over the internet we can try to understand adversarial government censorship. And then conversely, there's also a lot of sort of ground grassroots or organic hacker tools that are doing the same thing. And this is actually, maybe conversely, an area where the open source and hacker world has actually been leading on ethics, which is sort of fascinating. So I was doing this more from the academic side. I was at University of Washington where I did most of this and then spent last year as a postdoc at the University of Michigan. Placing that in terms of sort of what that looks like on the academic side, the sort of main conference groups where there's these academic papers come out of. ACM has a conference called the Internet Measurement Conference. USNIC Security has a bunch of this and SIGCOM is sort of another major conference. And then there's smaller sort of specialized working groups that are more focused. So Fokai is a workshop on free and open communication. Privacy Enhancing Technologies is also one of sort of considered at that top tier. And then passive and active measurement. So all of these happen every year basically. And these are the places you would expect publications of research looking at how we do, how censorship exists on the internet, how it gets measured, trying to characterize and circumvent it in some cases. Cool. So how do we measure censorship? This is a diagram from UNI, the open observatory of network interference, which is a project that falls under TOR and the TOR organization umbrella. And so the basic design of UNI and of a lot of systems is people build some software. The software will run on people's computers or phones in whatever country or network they're in and it will try and request content. And when it tries to request it directly it will see if it actually gets the thing it's expecting or if some adversarial presence in the network like a government, like a school, filter, firewall has messed with that connection. And then it'll have a reference. So UNI uses TOR to get the reference where it's just trying to be some random anonymous user, what is the actual content that I should expect. Other tools will like save what they sort of a canonical reference of what they expect content to be. And they're looking for divergence and when they see that they get back something that looks more like a block page rather than what they expect this content to be or if they've tried to connect to WhatsApp and that connection failed because of a reset at the TCP level that's when they're going to say look this connection is disrupted in your network. And so there's generally this layer of a software component that's running on end user devices combined with some sort of aggregation where reports all end up getting centralized because it's that centralization of reports that helps people understand overall what is blocked where, right? It's not as useful to know a single location because you can't tell really if that's a country level policy versus that's just that guy happened to be at a school or his company was blocking stuff. So it's the aggregate of many users that ends up providing you with this value. Cool. So the first like generation of measuring censorship and trying to understand these policies looks like this. And so on the on the academic side there's a project that came out of Harvard and a bunch of other groups called the open net initiative also I guess the other notable player there is the citizen lab in Toronto that took it a step further where these this this group of academics went and found civil society activist groups and organizations in a bunch of countries around the world and then you know worked with them so they would find these groups they would develop a relationship they would you know whenever there's political unrest or things of that nature they go over to them and say can you try testing this from these computers it's often a fairly manual process right where you actually you've developed that relationship not in an automated electronic way and also the like content that you're testing is is quite curated and is not a manual and is not an automatic thing. And so there's a whole set of these that ended up getting used people try and look for you know what are the like I don't know if legitimate right where but what are the botnets that we can use where we can get vantage points where we can start to do these measurements more automatically and so you end up with things like planet lab which is a set of university computers a bunch of universities all run computers and then are allow other unit everyone else who's running them can also run experiments on them. RIPE Atlas is run by RIPE which is one of under the ICAN umbrella is the European registrar they have a deployment of probes for testing network connectivity in I think 10,000 or several tens of thousands of locations at this point and then there's a whole bunch of others that sort of got used but it's all often in this model where the researcher somehow controls these end vantage points runs measurements on them and so you can think about what is the ethical risk here this is you know just one of many things where you know what happens if these end users are discovered to be cooperating with researchers or with projects and have the software on their phones or end devices that is measuring censorship so you know there's cases in China where where there's sort of measurement of what other apps are running on your phone and monitoring of your devices there have been reports in a couple places in the Middle East that at like border checkpoints or at movement track points within countries your phone gets looked at and like your Facebook profile gets looked at and the apps that you have installed get looked at so what happens if someone decides they don't like you know that you've got a network measurement app on your phone or on your computer this starts to you know there is some risk hard to quantify that you're going to piss someone off and that the users who are running this stuff get in trouble and so that this is you know leading up to what is this ethical dilemma so this broke on the academic side with a paper in 2015 at SIGCOM called Encore that came out of Princeton and Georgia Tech and so what Encore basically did is it said we can measure censorship by embedding a little web tracking script essentially so we're going to do something very similar to how web advertising works where you're going to take we're going to find participating web publishers so you'll go to our web page or just arbitrary web pages you the user don't need to know that you've actually gone to this web page that is running this JavaScript and it will try and load the favicons the little local like 32 by 32 pixel image that shows up in the tile bar from a bunch of other sites so you can think of this as the third party ads but they were using it for sites like Twitter so it would try and load the Twitter favicon and while the web security model means you can't like read the content from another domain so if I'm on you know Georgia Tech dot edu I can't read the my local JavaScript that's running on some end user's device can't read the image from Twitter that's a cross origin security model issue but it can tell if that image loaded successfully or unsuccessfully so they aggregate all of these attempts at loading and see if there's locations where the favicon for a specific domain never seems to load and show that that can be an indication that that site is now blocked in this location cool so we've gone sort of a step meta this is super automated now we don't have to get end users to run anything but also they don't know they're running anything and so this now spurs this whole ethical dilemma so this paper got published you can see this like top bar one more and so what that said was this was a statement from the program committee so this is this is the SIGCOM reviewers and it says we appreciated the technical contributions made in this paper but it was controversial because it raises ethical concerns and the controversy arose because the networking research community does not yet have guidelines or rules for the ethics of experiments that measure online censorship in accordance with our our guidelines the authors had not engaged with their institutional review board IRB or had their IRB determine that this was on ethical so if they didn't deal with IRB stuff and I'll talk about what IRB is we wouldn't publish this but they did engage with it it didn't flag it and so we're going to publish it but we're going to put this like big disclaimer here saying we're not sure if this is actually ethically sound and so this ends up being like the worst of both worlds on the academic side because now there's this whole set of papers of people wanting to explore this who are all now super freaked out that they're going to have like like what do we need to do to not have like the set of senior academics in the field being like this is academically question or ethically questionable so the PC does not endorse the use of the experimental techniques in this paper or the experiments that the authors conducted this is like somehow that the PC has decided they really like are not super happy with this okay so let's talk about what IRB was and what this paper actually did do so IRB came out of like the Stanford prison experiments the Milgram prison experiments right so this was like this sort of an experiment look there was deception where they didn't tell the participants what they were getting into they were measuring a bunch of psychological things people had a bunch of problems with this right it was it was something that fell very clearly on the unethical line where that probably if you had actually like tried to follow that you wouldn't have gotten and so we've got this thing in place at all US universities and really it's it's pretty international at this point called institutional review board where anytime in academics you're doing a an experiment that involves humans generally so human subjects so you get personally identifiable information you deal with humans you have to go through a process of approval from the university before you can do it and the motivation for the researcher to do that is if there are legal claims by having gone through this the university now has sort of said they'll handle that for you you aren't as individually liable by having done this pre vetting process to like tell the university what you're going to do they've had a board of people look at that who aren't directly in your area said yeah this seems ethical and appropriate and so then that means the university is covering you a lot more as an individual researcher but the trick is at most universities the place that this gets used is in medical and psychological things so it's really thinking about people and human subjects and so as we get to the network we start to have a lot of problems so in this case what was in on court is they said the IRB we went to the IRB at both Georgia Tech and Princeton and they declined to formally review our paper because it does not collect or analyze personally identifiable information and is not human subjects research so this has been a very common response that people going to IRBs are getting in academics on censorship and network measurement which is you're just talking to computers and IP addresses and since an IP address isn't a human you're not collecting anything that's you're there's no human names there's no like medical records of humans this is exempt this is research that we are not in a position to comment on and so this is the general response that you're getting is the IRBs opting out of saying yes we're willing to stamp this off but instead saying this isn't our jurisdiction and so now the PCs these like people reviewing the papers once they're done are saying well now that ethical burden is stuck on us because no one before hand has actually been able to like give this a stamp of approval so the academic side is it has been for it's been I don't know between five and ten years at this point that it's gone very slow that people are trying to be very cautious about understanding what it is that in academic publications they feel safe with talking about or how much data you can collect or what sorts of experiments you can do so the guiding principle right now is something called the Menlo report which weirdly came out of DHS which is a sort of ethical principles for network measurement or networked research and it has this whole set of things about like you know you should try and get informed consent if you can't get informed consent you need to follow a principle of least harm where you're trying to look at what is the minimal amount of data or interaction that you need to do in order to get the value from your research because of how wide of a set of research this is this ends up being often unsatisfying but this is sort of the set of guidelines that a lot of researchers are trying to follow when they talk about this yeah so the trick right is you can look at things like encore and I'll go through sort of how this has evolved in the last few years and there's a sense where this is way better the user doesn't know the user doesn't have any long lasting data on their computer there's less incriminating things so if someone is going after a user there's much less participation there's much less ability to really say the user wanted to like be seditious and so there's this thought or this inclination that by doing remote measurements like encore was you're not putting users at much less risk but this ends up having a lot of sort of unknown unknowns you don't know how the government is going to like mess with this if any government is going to mess with it and so people have been really scared about saying which of these is better which of these do we feel comfortable doing cool so this leads me to sort of the next place where ethical guidelines have come in and that's from scanning the internet so we had this was 2013 14 somewhere around the early 2010's Z map and mass scan both came out these were two sort of rapid internet port scanners right and and so the commercial variant here is like showed in or things of that nature there's these academic tools that are just going to be doing the scanning underneath those and so in Z map they talk about their recommended practices and principles for how you are a good scanner of the internet so if I'm sending packets out to every IP address on the internet what should I be doing to like be cool about doing that they don't get people in trouble and so a lot of these guidelines are very much on the technical side where they're things like you should not DDoS people accidentally you should like have an abuse report place where people can like complain and say please stop doing that and then you stop doing it so it's sort of more how to be a good internet citizen rather than are you accidentally going to get people in trouble so Z map gets used now to find a bunch of censorship especially DNS was sort of the easiest and so that's where a couple papers come out next a couple years after it and so what they do is they say there's 12 million open DNS servers around the world so a DNS server that any IP address can just ask for any resolution and it'll give back what it thinks the IP address of that website is so let's use the map to find all of those and then let's ask for all of these potentially sensitive sites to all of them and see which servers in which locations don't give back the response we expect for various sensitive sites so you know if I find all of the open DNS servers in China and I ask for a fallen gong website I get back a random IP address not the real IP address because China is using DNS censorship they're manipulating or messing up those DNS records to block that content so there's a specific technical slice of how censorship is implemented that you can get through this and so now there's sort of this question of okay can we do this is this ethical often DNS servers that are just open and answering queries for everyone are misconfigured we don't know if they're run by end users who might get in trouble for participating in this research or the fact that that DNS server now has been trying to resolve these potentially sensitive domains does that then you know someone who's looking at network traffic and says well who was looking at this illegal website and they see now this DNS server and so they go and like grab that user who owned that IP address right there's places in the world where your IP address is like tied to your identity I think I I think I heard the yeah so the basic question was there was a case in Russia two years ago where a tour exit operator who again is running sort of more generic infrastructure got put in jail for illegal activity that was conducted through his computer was this Bogotov or Dmitry Bogotov Dmitry Bogotov yeah yeah he was really just yeah yep and so tour is now sort of in a middle place because it's seen as that infrastructure again is like running an app you're somehow in this technology that is more focused on these sorts of activities DNS is sort of far on the like this is just very general purpose internet infrastructure you're not going to get in trouble just for running DNS that's how the internet functions but if someone misuses your DNS to do sensitive stuff could you get in trouble for that and that's sort of the question so 2016 satellite comes out uses 250,000 servers but limits its exposure to ethics by saying we're only going to do measurements on sort of the top the most popular Alexa domains so we're not going to try any domains that are sort of obscure and potentially really sensitive we're going to look at globally popular domains and look for disruption there so the theory being that if I look for Google everywhere yeah Google may actually be blocked so we'll find some of these instances of high-profile domains but it's unlikely someone's going to get in trouble for trying to request Google just because it's so big and there's so many people with devices that are always trying to hit Google and failing in whatever country 2017 so another year later there's this similar paper called iris that comes out at use next security they use six to 8,000 DNS servers that they believe are more vetted and so what they do is they look for these open DNS servers they filter that potentially 12 million of them down to about 8,000 that have a reverse where their host name of these open DNS servers is NS1 dot something and as a way of saying well these are servers that someone actually you know understood what they were doing enough to set them up as DNS servers they're not just random servers on the internet so we can feel more comfortable just requesting really whatever sites we want through this subset but they get much less vantage points they have I think roughly a hundred countries that they can see into whereas if you do all of them you can see like a hundred and eighty countries a hundred and ninety countries of vantage points so there's a straight off that people are still trying to find and so we can look at this and we're like okay so people are spending a lot of time thinking about these ethical things of like what can we query and then you look on the other side at like the tools that are out there that all of the hackers are using like showed in and they're like yeah we're just port scanning everything you look at you know what these sorts of DNS queries are they're lost in the noise right if you're actually measuring the internet the like one query per minute that any of these scanners for censorship are doing is so minimal compared to just the background internet noise floor that it's unlikely that this stuff is even triggering alerts and so that's this other part where that's not the way that the academic review boards are thinking about this at all they're just worried about you know this project getting seen some government getting unhappy about it going after someone and claiming it's in response to this it's not necessarily a reflection of the actual traffic that's happening so the the core ethical question I think is on the academic side what is this additional higher standard that that they need to be sending right like at Michigan or Washington or whatever I've got a lot of leeway I can be running Z-Map and ignoring all of this abuse complaints that I'm getting because I'm doing it for a legitimate research purpose but the but the the trade off is right like how do I make sure I'm a good internet citizen how do I make sure I'm not actually putting people in risk and that involves this quantification of how much risk is there in running sort of some sort of remote measurement of sensitive content potentially to end users who don't know and aren't consenting to have that run against them and that then is weighed because this this menlo report and the the set of that the framework is saying you have to weigh that risk against how much value you're getting but also the value is very unproven you know you've got successful projects like oh and I that are getting used a lot by policy analysts down the road but it's very difficult again to quantify well great we want to do this for getting a paper out and we think there's maybe future value but like how do we weigh that against this unknown potential risk so this debate's continuing beginning of this year turkey this actually happened like two years ago and it's still rolling out after the failed coup in turkey turkey ended up arresting about a hundred thousand people for they said using an encrypted messenger that was associated with an opposition party called by lock they knew that there were people in the opposition party that were using this thing called by lock and so they went to the national ISP and they said give us the IPs and names essentially of everyone who's connected to the by lock servers the ISP did some sort of query over their net logs got a hundred thousand people because there is a tie between IP addresses and customer records the initial phase they didn't like round up everyone bring them jail but like the list went public people like lost reputation and then over the next two years they started filtering that away because it turns out the ISP did no due diligence on this it just did that direct query and so then it was oh well the IP address in the DHCP wasn't actually the same at the time that this went moved so we just like got a bunch of random customers who weren't actually using this app but their IP address at some other point in the DHCP churn had contacted this the most recent one was it turned out by lock the miss the instant messaging protocol had advertisements and so they'd also collected all the people who just been visiting websites and had been shown advertisements from the same server that that was also connections from your computer to the by lock bad IP address and so the query of who which computers in our country have like connected to this bad server also include people who were shown advertisements and so at this point they've whittled their list that they like released and was dealing damage down to like 10,000 people from a hundred thousand so they've like reduced it by a factor of 10 and there's a bunch of people now who are I think that was before the advertisements and now there's a bunch of people who are like trying to get forensic analysis on their phones to show that it was just advertisements and not that they had the app installed so there's this whole mess of like the government doesn't actually care and isn't going to necessarily do do do diligence to make a point and potentially round up whoever it is they want to round up does this matter how much does it matter like you can be however careful you want is it enough so this is this is a current paper that's going to show up in news next week I think and so this was again scanning this is a application level DPI scanning technique that's doing something very similar they're finding echo servers so port nine they send an HTTP connection over port nine they get back the exact same response if they don't get back the same response or if the connection goes up they say there's probably something DPIing that didn't like that domain or that web request that we made and got played back and so they've got something very similar that they're saying still in terms of their ethical disclosure so this is their paragraph on ethics which actually they've got like a page and a half on ethics which is maybe the biggest like indicator that the academic community has not like felt safe about this yet but but the like the relevant paragraph is the risk to the user's low if we were to contact them to ask consent that interaction would be worse for them then not contacting them and just running this second we're using servers that the operators grant or if we yeah if we only use servers that the operators grant consent the operas would face much higher risk of reprisal since their participation would be easy to observe and would imply knowing complicity and third would be the difficulty of identifying and contacting you know half the internet right there's a couple problems here yeah so so again this like trade-off between your benefits and your risks there's not necessarily an easy way to do this with the you know men law reports try and get informed consent side so how do we how do we you know remedy this a lot of what's happened so far is blending into this background noise if you can trickle out queries and not have them be sensitive and you can see similar queries already going on from that ground floor you feel somehow safer about that one of the things that's happening more now is people are saying can we not end up going to end host so can we set the TTL's or other properties about the connection so we don't actually reach the end machines because if we don't reach the end machines that maybe doesn't actually trigger the full connection happening and so when some ISP looks at their logs they won't see a successful connection between two bad machines when you go talk to like ISP operators about what it is that that query is actually doing it's about like there was a connection and so if you can actually probe without fully establishing the connection or faking that connection you can maybe learn stuff without that record being present that seems to minimize risk or not minimize but that seems to lower risk but it's way technically hard so there's this trade-off there and there's a lot of these techniques right and though the on-core web technique you couldn't do that way because you don't control that level of TCP stack through JavaScript. The worrying variant of this is this like subprime repackaging that we're seeing because while people have been very cautious about using scanning or directly measuring what there has been a huge rise of recently on the academic side is making use of VPNs that have a lot of vantage points and so the one that I want to talk about is Hola or Luminati this is a VPN it's a peer-to-peer VPN although there's a lot of people who have apparently installed this thing it's a free VPN gets around you know you can watch movies in Britain you install this thing it's free cool what you don't know is that they have this other half of their company called Luminati which is their business-facing one that says you pay us like $500 a month and you can send traffic out through any of the people who installed our free VPN that's how they actually make their money and it's advertised to businesses that want to do like market competitor analysis you want to like see how normal users like see your product or see your competitors products without revealing who you are great we've got the product for you and so there's a bunch of academic papers that have now done censorship measurement or done other things through this and they say well we went through this company's vetting process they supplied the vantage points they got informed consent from the users which is now did they that's a question there's been a couple like sort of hacker exposés of like these guys are just sort of packaging with malware and stuff to get their usage counts up but it means it has pushed the ethical dilemma away from the from the research community so this ends up being sort of the preference for how to do this faster rather than worrying about actually like figuring out where you are drawing your ethical line and then arguing about it um yeah I guess the the only other thing I'd say that that echo paper that's coming out apparently generated like 50 pages of PC like review comments before it came out it doesn't have it doesn't have like a a disclaimer saying it's unethical but like it generated like still like this huge amount of debate and had like a bunch of effort around it to be like yeah we we believe we've covered our basis so getting this sort of stuff published on the academic side still iffy a lot of work cool I think that's basically what I wanted to say happy to take questions yeah so with the with the VPN that you're doing are these this access the volunteers are in other countries and those other countries when they access the censored website is that considered illegal or treasonous and if so are you then aiding them in breaking the law in our own country potentially right I mean accessing content like there's different laws in a lot of different places um the the worry is like yeah there's definitely places where access to specific websites is illegal the other answer is if the user doesn't know anything about it and they've got malware installed on their computer that's accessing something there have been very few cases of anyone being held accountable for that right like so there's there's the law of like you know you control this device what happens on this network is your responsibility and then that contrasts with the like what enforcement is so so is anyone actually getting put in jail or in trouble for this and and the turkey case was sort of the first time that that really happened in a meaningful way to my knowledge yeah so so the question was um when you don't establish the the full tcp connection is the is the reason there uh like javascript can't do that and yeah that that's so that's why something like on core camp there have been uh a set of papers that came out around doing this uh so that's called spooky scan I believe is the the name of this or auger uh are the tools for that and what it's doing is it's saying I don't control uh computers ARB but I want to test if they can form a tcp connection between them uh and it's going to use the tcp fragmentation buffer where it fills up the tcp fragmentation buffer directly from itself and then it spoofs at the IP layer a packet that comes from that machine to another machine if there's a reset that comes back it flushes part of the tcp fragmentation buffer and then you can probe back to the machine and see if you if the tcp buffer got flushed out uh or not um so there's a set of like weird side channels in how tcp works that let you learn this um just at like while there was an attempted connection a reset or spoof packet a reset these aren't going to look like full connections um but are able to help uh the other thing that people do is they'll say okay here's an end host I'm going to trace route to it I'm going to go to hops back I'm going to assume that's a router and I'm only going to like connect to things that are routers where there's other IPs behind them and I'm going to use that as a proxy to say I'm not talking to end users but only to infrastructural machines um so if you send your traffic to those routers like if there's some dpi box that's looking for like a bad hp header it's still maybe it's going to break it but you didn't target an end user you were just sending it to some like router that wouldn't have would have dropped on floor probably or I can repeat it either way yeah so the question is am I aware about the decentralized web summit that happened last week or a couple weeks ago uh in SF I didn't make it unfortunately but there's a bunch of cool stuff there um so this is sort of the the other side which is how do we like add more resilience to the internet how do we make it so that censorship is harder uh and so there's a whole set of uh peer to peer technologies where you're trying to form more resilient connections um some of this there there's a whole lot of debate and discussion about what the right ways are to do this uh and there's also risks right so um I know like there's a bunch of different anecdotes on that I'm I am somewhat aware of it I guess the the main things I would call out so right there was this whole thing called free net back in the day uh right which was like there's publisher but everyone just sort of caches content uh and content gets pushed around the network and you grab it from somewhere and it's all content addressed and that got people in trouble because it there was a bunch of bad shit that got put on there um the newer versions of trying to do something similar to that like IPFS are saying we're not going to like proactively cash stuff or pushed up around the network we're only going to do it when someone actually requests it so if I pull content and want to watch a video then my computer will be willing to rebroadcast it but if I'm not associated with that content I'm not going to like I don't want my hands on it so we've gotten more careful about that as a way of like trying to prevent harm to users who aren't affiliated with the content um the other like website of this is things like web torrent um where you're using web RTC and some of these video chat things to be able to push content around within people's web browsers without users knowing that this additional layer of resilience is happening um because web RTC is so hard to work with nothing's come out of that so far like we've got limited video stuff but we haven't really got like a full decentralized web yet that's fast that people would use without knowing they're using it um but we can hope that'll happen cool I think uh I'm at 4550 so it's about end cool thanks