 Here's the basic statement of the problem. You have a user, Alice, Alice and Bob are the traditional names for users described in the cryptographic system. Alice is in a place like China where they control and monitor all internet traffic in and out of the country. Alice is communicating with a machine run by Bob, which is somewhere outside China like in the US. The assumption is that China does not know about Bob's machine, so they're not blocking communication between Alice and Bob, so she can send requests to him requesting the content of websites which he has access to, and he can take the content and pass them back to her. But they have to use a communication method that cannot be automatically detected by the Chinese who are monitoring internet traffic in and out of their country. Mallory, just a funny name for a Chinese network administrator, but he's called Mallory, is administering the great Chinese firewall and monitoring all traffic in and out of China. He can, what did I say, have I got like a bugger hanging out of my nose or something? Why is everybody laughing? Yeah, right, I mean that's an old expression. I did that on purpose. So he blocks access to sites that Alice is not supposed to be allowed to communicate with like cnn.com, but the default policy is to allow sites that have not yet been classified, so Alice is allowed to communicate with Bob, but the protocol that Alice uses to communicate with Bob's websites cannot have any characteristics which can be detected easily. There are a lot of so-called solutions that are floating around that aren't really secure, like websites where you go and you can type in a URL in a field, in a form, and it will go and request the content of the page for you. Those types of things are really insecure against monitoring, because not only can they, you can actually monitor for network traffic that indicates somebody types something into a field beginning with HTTP, not only can they see that, they can also see the traffic the other way and see what you were looking at. So it's sort of a temporary solution, but it's not the kind of thing you would rely on in a situation like Alice. There are two, two sort of ways of looking at this. One of them is sort of called the arms race strategy and the getting it right the first time strategy. Arms race means that we would design a protocol for Alice to communicate with Bob, knowing that the Chinese, that Mallory could figure out a way to block and detect that protocol and they would say, okay, we'll fix that in the next revision because it'll take them a few product cycles to figure out how to detect this protocol that we're using and by the time they fix that, we'll have another one. So that's the first strategy. We publish a protocol that has known weaknesses. The Chinese government, the people who make the monitoring software, create software that detects those weaknesses, we create a new one and so on. The second strategy is getting it right the first time, which that doesn't literally mean getting it right the first time. You never get it right the first time, but you should at least attempt to get it right the first time. You should never try and distribute a protocol that has known weaknesses and to say, well, we'll take them a while to catch up and by that time we'll have a new version that doesn't have those weaknesses in it. I'm going to argue for why the arms race strategy, which is constantly coming up with new versions of the protocol is a bad idea. First of all, if we publish a version 0.0 of some protocol and then lots of different bobsites spring up all over the internet using 0.0 and then we know that it's got a flaw in it, so okay, two months from now we'll have version 0.1 and then half of the sites upgrade to 0.1 and so on. The Alice's inside China using the software have got to communicate with different websites running by Bob and version negotiation is not something that's easy in this scenario because, remember, all the communication between Alice and Bob has to be something that looks completely unsuspicious to somebody monitoring the traffic, so it's not possible to have Alice sending out a header saying I would want to open a session using version 0.0 or version 0.1 of this protocol. So any sort of situation where you end up with lots of different clients and servers using different versions of the software is a really bad idea. Also, the problem is if we follow a plan which requires people to keep updating the protocol and making fixes to it, I don't want to spend three months on something that's going to be broken after a week. Somebody has to keep, somebody is going to have to keep updating the protocol to take care of all the fixes. Do you want to do it? Yeah, volunteer, I don't want to. That's why I'd prefer that if we're going to invest this amount of effort in designing a secure protocol, get it right, the first time strategy is better. Also, if you design and deploy a protocol that people use that can be detected easily, problem is after that's been in use for a few months, all the Bob people who are running servers that Alice can communicate with, the Chinese network monitors will have detected those servers that Bob is running because, as we said, if the protocol has a known weakness and it gives the sensors and monitors a chance to detect all the people who are running it, then they will then know that those sites are renegades and they will add them to their block list. And so even if version 0.1 then comes out and it is secure and cannot be detected easily, it'll be too late because the sensors and monitors will have blocked all the people who are running these circumventor sites. Also, of course, from the user's point of view, Alice inside China, if she's using version 0.0, doesn't do any good if version 0.1 comes out. If she has already been arrested for using the old one, which was deployed even though it had known weaknesses in it. So this sort of seems like a given from a security point of view, but I've been running a mailing list for a while discussing how to come up with this type of protocol. And I hear a lot of people saying stuff that sounds like they're edging towards where you just say, okay, let's release it this way. And it will be inconvenient for the sensor where companies to come up with a way to fix against this so it will buy us some time. And by the time they fix it, they'll come up with the next version. And I was just explaining why I think that's a bad idea. But people prefer things like, well, let's have a protocol where the browser passes cookies out to the server and inside the cookie is data, which represents the URL that you want to get. That is a short-term solution that's going to be blocked very quickly. It will either result in the sensor ring network administrators blocking all cookies by default and only allowing them for sites like Hotmail that really need them, or also they'll be able to monitor the cookie stream and tell what you're looking at anyway. Or something like use JavaScript to encrypt URLs submitted through forms. That means if you're viewing a page on the client end and I explain why it's a bad idea to type HTTP colon slash slash CNN.com into a form and submit it because the sensor network administrators can spot something like that. You could use a standard JavaScript on all these pages on the client end that can be used to encrypt the URL and submit it. The problem is that as soon as you do that, the sensor wear makers will get a copy of that JavaScript code and have their sensoring proxy servers just monitor incoming pages for that snippet of code. And they will know that those pages are hostile sites that are used to circumvent their network and they'll block them. Route 13 is guys page contents, another obvious example. Route 13 is where you will place every letter on the page with a letter 13 places away from it in the alphabet wrapping around at 26. Obviously if anybody invented a protocol based on that, the sensor wear makers would design their proxy to scan for pages that have been Route 13. And this is stuff that people have actually proposed over the past couple of weeks on this discussion list talking about this problem, presumably because they're not missing the point over why arms race syndrome is a bad idea, which means saying, OK, we'll publish it this way. And of course, they can come up with a way to detect it later, but that'll buy us some time and they will come out with a new version. It's a very bad idea to get caught up in thinking that way. This is why the strategy called getting it right the first time is a bad idea. So the core assumptions for the get it right the first time strategy means you want to design a protocol that can never be detected when it's in use. You want to design it such that even the Chinese government knows the details of your protocol. You publish all the source code for the client on Alice's end and they serve around Bob's end and they can examine the details of the protocol, but they still cannot write software to detect when it is being used. And that way, Alice can use it to communicate with Bob the screen and Bob can deinterpret the request sent from Alice and then request the websites and pass them back to her. There have been a lot of discussion about how important is it for the requests from going from Alice's machine to Bob's machine to be encrypted. There's a key point here that encryption is not as important as steganography because we're designing a system to be used in the kind of situation where just the fact that you're circumventing the system at all is likely to get you in trouble. If they can figure out that you've circumvented the system, they're going to be not as concerned with what you're actually looking at, but simply the fact that you got around it is going to be enough to get you in trouble. So we're talking not about places like the US where they might have limited monitoring of its citizens, but places where the regime is so strict that you can be punished just for attempting to break the rules, places like China or Iraq or high school places where they're really cracking down on you. Another epiphany that came to us after a while was it's just as valid to use secret key encryption for Bob's site as it is to use public key encryption. And we want to use secret key encryption instead because for reasons that are explained later, you want the number of bits transmitted back and forth to be very small. And if you use public key encryption securely, you're already running into a problem where you have to transmit 120 or 1,024 bits back and forth. But the thing about public key encryption is that it's appropriate only for a site where everybody knows about it, including the adversary trying to monitor traffic, knows about the site you're communicating with, and they've published a public key and the adversary has access to the public key, but they still can't decode the transmissions. For Bob's website, you're assuming that the adversary, Mallory, the Chinese government monitoring you, doesn't know about the site anyway. So because the communication endpoint is secret, you don't lose anything by making the encryption key secret as well. So whenever Bob distributes to people the location of his circumventor website, he just distributes the secret encryption key along with it as well. And like I said, for the get it right the first time strategy, you're not trying to design a protocol that will be secure against whatever currently implemented monitoring, censoring proxy servers exist. You want to design it to be secure against current and future monitoring, censoring proxies, which means assume that the network administrator in this case is using what you imagine to be the perfect program. There are none of the shortcomings that exist in the one you have now. This means you could actually track users on a per session basis. If somebody dials up on two different occasions, you can look at the traffic from their first dial in session and on their second dial in session. And if you notice anything unusual about their second traffic session, then you can detect a deviation from their normal habits. And then that's only possible, of course, if you track every user individually. For certain features like HTTPS, that would seem like a really easy way to solve all this problem. Just have Bob's site use HTTPS for Alice to communicate with them. I believe that if we deployed a protocol based on something like HTTPS, which is really easy to detect, what would happen is the makers of the censoring monitoring proxies would implement a policy where HTTPS is blocked by default and then only allowed back for specific sites like Amazon.com to have a legitimate use for it. And then the sensors, no, are not running a circumventor server on the back end. I actually checked just anecdotally to look at my browser history to see how often I'm actually using an HTTPS site. And it was only like five sites in the previous month that I had hit using HTTPS. So it is actually not going to be a huge impact on what these sensors consider to be legitimate use if they implement a policy where HTTPS is blocked by default. Same thing with things like cookies. If you design a protocol that relies on cookies, they're probably gonna do the same kind of thing where they implement a policy where cookies get blocked from websites by default. And then they will add it back for sites like copmail.com that have a legitimate use for it and can't work without it. Also, for some of the features that were described later for this ideal perfect censoring software, some of the functions that might need to conduct would be too processor intensive if they did it for all users. Like if you relied on being able to parse a JavaScript on the pages that a user was looking at, that's obviously not feasible to do for the entire population of China, or the entire online population of China, which is a lot smaller. But they don't have the resources to parse that for everybody, but they could do random audits on users where they select. In any given moment, 1% of all internet users going through their system are subject to a more processor intensive audit where they look more closely at each of the pages and tell whether they're using, let's say, malicious JavaScript code or not. Also remember that the computing power doubles every 18 months and the online population does not. So if you want this to be secure within the near future, you have to rely on the censoring proxies, being able to devote exponentially more and more computing power to monitoring each individual user every year. Both of these lasting points are just basically a response to objections that are frequently made saying that there's no way the censor could do blah, blah, blah, because they don't have the computing power to monitor everybody to the extent that we think they would. And yes, that might be true now, but they can solve this problem both by waiting for computers to get faster and also by only monitoring a certain subset of their users within each time period. Another thing that comes up pretty frequently on the list where this problem is discussed, it is not fair to assume that these censors have to, you know, play by the rules, meaning that you can't complain about, you know, if they're, say for example, if they're gonna start blocking cookies by default knowing allowing them back for websites. It's not fair to say, well, they're not going to do that because it would violate RFC, something or other, that you can't whine about how the censor is not playing fair by implementing a policy that will block a lot of legitimate sites. And I see people doing this when we discuss this and they say, well, they'll never block HTTPS because so many sites require it. But the fact is that the software that is implemented to censor pornography and controversial political content already blocks so many legitimate sites that would actually not be that much of an impact for them to block HTTPS by default. The only rule we're following is that Mallory's, the censor proxy server has to be useful to what they consider to be legitimate users. That means that the default policy for websites have to be allowed by default. And then they only block websites that they know to be malicious. But in order for Alice to be able to communicate with Bob's website in the first place, they have to have what's called, the censors have to have what's called a blacklist policy, which means all sites are allowed by default and then only sites on the blacklist get blocked. That's the opposite of a white list policy where all sites are blocked by default and only sites on the white list get allowed. There are some programs that use this, but they make the internet so useless because there's no way the manufacturer can classify everything. Two theoretical problems are how to communicate data from Alice to Bob's machine. She wants to communicate the URL that she wants to look at. The Bob needs then the reverse direction that the internet data going from Bob's machine to Alice has to contain the page contents that she wants to view. This is just underlying the obvious here, but these two problems actually have some theoretical differences. First of all, when Alice needs to communicate her URL request to Bob's machine, the URL request has to arrive exactly as it was originally stated. If you get even one letter in a URL wrong, then the garbled data is useless. You'll usually not get the right page. That's not true, for example, of getting the page from Bob's machine back to Alice. You can scramble it a little and it'll still be useful to the recipient. Also, in stating the problem, you have to decide whether you want Alice's machine just to be able to send a URL request over to Bob or whether you also want to support other HTTP header information like communicating cookies. What if Alice wants to fill out a post form? Do you want the protocol to support that kind of information? Because all of that makes it more difficult. Again, like I said, it's different going the other direction because from a theoretical point of view, the communication going from Bob's machine back to Alice's machine has to arrive, or it does not have to arrive exactly as it was sent. It's acceptable if Bob's machine has to serve the page back in a way that it might not be recognizable with the censoring proxies you're allowed to scramble data or place words, do things like that in such a way that the monitors won't detect it. So, assuming now you've got Alice's machine and you want to figure out a way to support a way for Alice to send a request to Bob and for Bob to send the page contents back to Alice. There are two broad, there's a fork and a road here, the two broad strategies you can look at. Either you assume that the only piece of software Alice needs is the browser and then you use built-in features of the browser like cookies and HTTPS support and JavaScript to provide Alice with a way to communicate with Bob. Obviously that'd be preferable because that's a lot easier. You don't have to distribute software to users if there's an upgrade on the server end and the client end will absorb that upgrade immediately because it's just the browser and so on. The hard way would be you have to distribute software to Alice so that Alice's browser communicates with the software and the software sends the request out to Bob's machine and Bob's machine decodes the request and sends data back through the same path. What we've discovered through this discussion is that it's really not possible to come up with a satisfactory solution using the easy way, meaning using just the browser on the client end on Alice's machine. Both of the previously stated problems end up falling apart. There's no way to get the URL request from Alice's machine to Bob without it being detected and there's no way to get the page contents from Bob back to Alice without the snooper being able to see where you were looking at or not being able to see where you were looking at but being able to detect that you were using a circumvention system which would place Alice in danger. Specifically, most of this is sort of a summary of discussions that went around on the list discussing why ideas that might seem simple at first do not work and discussed why if you, there are a lot of the existing sites like Anonymizer where you can type in a URL and hit, get, hit, go and so I will fetch it for you. The problem with things like this is that it's easy to monitor, get and post data submitted by a browser through Forms so that you can see if somebody typed in HTTP colon slash slash something and it'd be easy for the Chinese government to just say monitor for any site that looks like that and then assume that it's a way of circumventing our firewall and even if you had it designed in such a way that the HTTP in the beginning was not necessary, it's really easy to spot if somebody's typed a URL into a field on a farm. I mean a lot of this seems really trivially obvious but it's a result of going around in circles for a long time and sort of a discussion of how to solve this problem. So when this is pointed out, somebody next said well instead of having a single form field on a browser where you type in the URL, well let's just have multiple fields next to each other and you split the URL across those. The problem is this is a good example of why steganography is a lot harder than cryptography. It's not that that wouldn't work, it's that you're creating a situation which is rare enough that it would stand out to somebody who's monitoring your communication. What's rare about that is that most forms only contain one field where you can type in text and there are lots and lots of exceptions but that's generally the rule and forms where you could, excuse me, forms where you'd be asked to split data across several text fields would stand out to somebody who's monitoring for that type of communication. Also, of course, if you've published your protocol about how to split the data across different form fields, it's easy for the sensor to reverse engineer that and figure out where you're typing in. Of course, so that idea was batted around for a while. It was inevitable that somebody would suggest using HTTPS for the communication both ways. Again, the problem is not that this will not work but that it's easy to detect. First of all, it's easy to detect because it runs on a different port number from HTTP, obviously but even just if you ran it on port 80 you can still see the kind of data that people submit back and forth. It's so obviously encrypted that it would be trivial for somebody to monitor for when you are using HTTPS and block it by default. That would probably be the consequence of doing something like this is, like I said, sensors would start blocking HTTPS for all sites and only allowing it for sites like Amazon.com that can actually have a legitimate use for it. Somebody, the ideas that we're progressing through here, they get cooler and cooler but they still do not really solve the problem. Again, I was talking about earlier, if you type a URL into a form and then click a button that executes some JavaScript code on that form to scramble the URL's contents and then submit the form. The person monitoring the traffic that goes over the wire will never actually see the URL that you typed in. They'll only see the scrambled version of it but the problem is that you'd have to publish pages that have that JavaScript code on it to do the conversion of the scrambling of the algorithm and what the sensors could do is obviously they could monitor for any pages that have that specific code on it and when they see that they know what you're doing and they block the page. This inspired the next suggestion which is also really cool and still doesn't solve the problem. Polymorphic JavaScript. Polymorphism refers to a type of code that virus writers usually use to make the viruses unrecognizable to software that scans for. The idea is that the polymorphic code, the specific instructions, the shape or the form of the code will change but the logical effect of the code stays the same. So it modifies itself in ways that don't affect the output of the code but it still wants the same way. And the problem with this is writing polymorphic code is really difficult and the fact is that even if you change just a few variable names and stuff like that it would still be possible for the censoring proxy to detect the shape of your JavaScript code when it sees it on the page. If they know exactly what to look for and they know the algorithm that you're using to generate the code it is not actually that difficult to write. Simple ProScript will recognize pages generated for that code even if the code looks a little bit different every time. Also this is something where I mentioned earlier that if the sensors don't have the processing power to do certain types of audits on every single page that every user looks at but if they focus on a certain portion of their user audience and just do very processor intensive monitoring on them they could actually do things like parse the JavaScript code on a page viewed by a certain user and determine whether or not this code is executing is a polymorphic version of the JavaScript that can be used to circumvent our site. So, another idea, gone. Those were all discussing weaknesses in the problem of transmitting the URL from Alice's machine to Bob. The other problem is how do you get the page contents of the page Alice wants to view from Bob's machine back to Alice. Again, there was the HTTPS idea where I explained what happened if people used that it would start getting blocked by default it was just too easy to detect. Somebody also suggested you could send when Bob's machine communicates back to Alice you could send back a page that is structured in such a way so that JavaScript code on the page writes out like document.write calls it's a JavaScript function that will actually write out the entire contents of the page that Alice wants to view. That means that the actual literal text of the page will not be visible to the censoring proxy server monitoring that stuff but still you've got the problem that JavaScript code and pages that consist only of JavaScript that writes the entire page those are also very easy to detect and it was another idea put forward I think by someone who I got the feeling that people were not really applying a lot of trying to attack their own ideas very carefully when they were suggesting this kind of stuff because a lot of it it only took a moment's reflection to see why it'd be too easy to detect and why it would not work. I think people get into habits thinking about problems from a cryptographic point of view where you only have to be secure against people breaking it and the problems where you have to be secure against people detecting that you're using a protocol are different. Next idea that came up how about sending the page back to the user so there's scramble in a certain way where the letter s replaced with a dollar sign for example this would be useful if the censoring proxy was monitoring for specific pages that you're viewing not being able to detect Alice communicating with Bob but they look at the contents of pages that Alice is viewing and say if the proxy blocks cnn.com which they actually do in China what you would do is the censoring proxy would just keep a cached copy of cnn.com refreshed every day and then every time Alice views a page look at the contents of that page and see if it's identical to cnn.com and then if it is then you know that Alice is using some service to circumvent the Chinese firewall and view banned content. So this is the idea that somebody put forward to prevent against that was scramble the page in such a way that it is not recognizable to the censor. The problem is that if you scramble in such a way that people can still recognize the text then the censoring proxy can perform the reverse of the process and determine what the original page was obviously anything that's just scramble or replacing s's with dollar signs censoring proxy can compare that to the known plain text of the bad side and still make the comparison. It doesn't help if you add randomness to the process say that such that only half of the s's get replaced with dollar signs so that the censoring proxy doesn't have an exact page that it's monitoring for but it just looks it can still match the scramble version of the page easily against the original version. The problem is that you can only mangle the text on a page so much before it becomes unreadable to a human being as well. Even if you start switching around letters and words and stuff since it becomes beyond reaching the point where humans can't recognize it either. And in general from an AI point of view the problem is if the human reading the page can still recognize the original text and the meaning of it the proxy server can still look at a copy of the page and match it against the known play text of the page that is blocked. Somebody else came up with another idea well how can you just send the page back to the user by taking the text on the page and burying it in a very large image and setting that back to the user. Well, okay so is OCR really something to worry about? They obviously do not have time to perform OCR on every page on every image of every page viewed by a user in China but again we're talking about, remember the increasing speed of computing power is increasing faster than the online population which means that they'll have exponentially more and more computing power devoted to auditing each user. More importantly they can select a subset of users that they want to monitor and anything that is computationally feasible like OCR is could be done if they really cared enough. They'll solve the problem and any time you have human beings who are devoting resources to monitoring the pages being viewed by specific users obviously this would not be secure against that they can look at the image and tell what you were doing. So and of course using large images like that would be a lot slower from the user's point of view and would not be a big enough gain to make up for the performance that that would cause. So that segment was all explaining why it does not work to rely on built in features of the browser to decode the communication being sent back to Alice from Bob's machine. What you need to do is you need to have a program running on Alice's machine that can do the decoding of the data being sent back from Bob. That would be like a miniature proxy server that Alice runs on her computer and then she sets her browser to use that miniature proxy as the proxy server. The proxy server implements this protocol that we're trying to design here and communicates with Bob's machine outside of China and sends data back and forth in such a way that somebody monitoring the data stream can't tell they're doing anything suspicious. Again, the two separate forms are how to get the data in the URL request from Alice to Bob and how to get the page contents from Bob back to Alice. And the protocol that we're working on designing actually takes care of both of these problems in such a way that depending on the level of security you want in the first version a software program monitoring the communication would not be able to spot anything. And if you want to be really, really secure you can even get it so that a human being monitoring the communication stream would not notice anything suspicious. The main problem is that URL requests going from Alice to Bob have to be disguised as the kind of web surfing traffic that a normal user would engage in. That three types of things mainly that a web surfer would do would be you type in URLs that are not linked from a page you previously viewed or you click on a link from the page that you're currently viewing or you fill out a form to submit data. Now the next step is explaining why looking at web traffic in terms of those three activities why it does not work to use long and garbled URLs just to transmit data from Alice and Jean to Bob has a very common proposed solution to the problem. Look, if you look through your surfer log files you'll see that people access URLs all the time that have a hundred characters in them or more. So let's just use, excuse me, let's just use some of these long and random looking URLs to transmit data from Alice's machine to Bob and nobody will ever notice any difference. And then Bob's machine which is using the same protocol can receive these URL requests and decode them to get the original data back. The reason that doesn't work is that during normal web surfing a user does not generally access those long and random looking URLs unless those URLs were linked from a page that was being viewed immediately prior to that by the same user. And we're assuming that the sensor and proxy can track web surfing traffic on a per user basis. That means if you're looking at a page that has a bunch of hundred character URLs on them and if you click on one of those links and then go to that next page that does not look suspicious to a sensor monitoring the traffic. But if the next URL that you access is a hundred character URL that is not linked from the page you're currently viewing then that gets flagged as suspicious by the people monitoring the traffic. You won't necessarily block that URL because it can happen. Somebody could be typing in something for their bookmarks list or a URL that somebody emailed them. But that would be flagged as suspicious and if it happens over and over again then you've got a profile on a particular user that you think might be trying to circumvent your system. So to guard against this or the sensor and proxy would end up doing was they would maintain on a per user basis a list of links that the user is currently allowed to access without looking suspicious. That would include links that are linked from the page that are currently viewing. Links that are created on the page by JavaScript code executing. A lot of links are dynamically created every time you load the page or links or URLs that you could access as a result of filling out a form on the page. What happens when you fill out a form is that for a get form the data that you enter in the fields on the form would get moved into the URL. So the software monitoring your web traffic would scan for forms on pages that you load and then any URL that could be constructed out of that form would be okay. But any other long URL that wasn't in that format would be considered suspicious looking traffic. So this is a real bottleneck in how much information you can submit through a form. If a user is filling out a form then the form has a hidden field in it then when the user submits that form the value of the hidden field should be the same as whatever is set by the website. I'm assuming a little bit of knowledge of HTML and forms here, which I don't really have time to explain but if you know how to design a form on a web page yeah, speaking of stuff that CDC doesn't know how to do what's your URL here? Okay, that didn't quite work. If you know how to design a form on a web page if you set a value of a hidden field in the form when the user submits the form the value of that hidden field has to be the same as what you set for them. Otherwise they will give themselves away as doing something out of the ordinary. If you have a drop-down list on the form that has eight choices then the user can only submit three bits of information by making one of those eight choices. So it's not an efficient way for Alice to communicate with Bob by submitting a form over and over again and making choices on a drop-down list. If each time you make a choice you can only communicate three bits of information. If you have a text field on a form then you have the opportunity to submit more bits of information. It's a variable amount of information that can go in a text field. The problem is that the sensor and proxy could adopt a policy where if you submit too much text in a form then that is considered suspicious traffic that possibly could be used to circumvent our system. So they're gonna implement a policy where any form has more than a certain number of characters in it will get blocked by default but then websites which need that functionality to work like Hotmail.com you will be allowed to submit forms at sites like that but any other site you will not be allowed to submit more than a certain amount of information through a form because otherwise that would be considered suspicious traffic that might be used to get around the proxy. So the solution to getting information from Alice to Bob is that Alice's machine has this miniature proxy server installed. Alice types a request into her web browser for a site that the Chinese government bans. The web browser would talk to the software installed on Alice's machine saying who wants to format a request for www.cnn.com. And then the miniature proxy server on Alice's machine would take the data in the request www.cnn.com and convert it into some sort of text query that could be entered in a form on a web page without looking too suspicious. That text query would go out over the wire to Bob's machine somewhere in the States and then the only thing that the people monitoring the traffic would see was they would see a text query going from Alice to Bob and they wouldn't obviously would not know anything about what that query represents because they wouldn't have Bob's secret encryption key so they wouldn't know the URL that Alice is requesting. Or more importantly, not only would they not know what Alice is requesting, but they would not notice that they wouldn't know she was requesting anything at all. It would look like a completely innocent communication. The problem is that if you have too much, if the censoring proxy is imposing a limit on how much data you can submit through a form, so what happens if you have a URL request that if you break it up into multiple parts, it's too big to fit into one text query so you have to break it up into multiple resubmissions on the same page. That is something that you want to avoid because if you wanted to do that you would have to make sure that between each time you submit a request you would have to pause for a few seconds because to anybody monitoring the traffic you want to look like a real user and a real user submitting text at a query page would not submit a bunch of queries with a 0.1 second pause between them. They would submit something, wait a few seconds, look at the results, submit something again. So that creates a delay that's gonna be a real problem if you have so much data that you're going to have to break it up over multiple text queries in order not to get detected. So still one thing we didn't explain was how to convert the text, how to convert the URL request that Alice wants to send into a text query that can go out over the wire to Bob's machine and not look suspicious to somebody monitoring the traffic. You can use an algorithm that maps bits to letters or something like that. The problem is if your text query ends up consisting entirely of non-English words or in the case of China, non-Chinese words or whatever, it's gonna look, that is also going to look suspicious to somebody monitoring the traffic. So you'll have to have a certain mix of English words in the query just so that it doesn't set off any red flags to people monitoring you. So what is meant by a query that doesn't look suspicious or not? If a machine is not, if it's just a machine that the sensors are using to monitor traffic, then there are certain combinations of words that would not look suspicious to a machine, but if there's an actual human being monitoring you, then anything that looks like a random mortgage jumbled together might be a tip off to a human monitoring you that you're using some kind of system to circumvent their policy. And again, the second problem is how should you get the pages from Bob's machine back to Alice's machine in such a way that they look like real web pages and they also have to not resemble the original web pages that Alice is actually requesting because we assume the censoring proxy can compare the pages against the known text on a banned site in order to recognize it. You should not rely on the censoring proxy taking the pages as served back from Bob's machine and not modifying them. If you assume that the contents on Bob's machine will be returned exactly, and your protocol relies on that, then all the sensors are gonna start doing is they'll add an extra character at the end of every page or something and then that will no longer be a valid assumption. And also explain why you should not rely on cookies or HTTPS or things like that, which the sensors could block very easily. Couple things that people came up with. Use the least significant bit of an image data in an image to hide information. Bob's machine would take the data, wants to send back to Alice, hide it inside an image, send the image back to Alice as part of a web page and Alice's machine would receive the image and her proxy server would decode the image data and serve Alice the page that she wants to look at. Problem with that is anything that relies on noise in a communication stream has a problem from a steganography point of view because the sensor could introduce more noise into your stream in such a way that if they don't want to make the service inconvenient to legitimate users, so they don't want to introduce too much noise into an image that does not already have static in it. But if they detect static in an image that is being served back, they can say, well, it's either an image that has real static in it, in which case we can introduce some more and it won't matter. Or it's an image that is being used by somebody trying to circumvent our system, which case we can put in some more noise and it will then disrupt the communication stream. So that was another idea that came to everybody about the same time, but also has problems with which means we had to reject it. Something else, this is another relies on a different kind of noise stream. You could, when Bob sends the pages back to Alice, you could take the pages in and the source code of the page will place a character with ampersand followed by the hexadecimal code for that character followed by a semicolon. Then that actually causes the page to be displayed the same way in the browser, but Alice and Bob could agree on some system where like, if I have the regular character, it's a zero. If I have this funky encryption character, it's a one. And we can use that to send zeros and ones back and forth. Again, you have the problem that the sensor can modify pages there being sent back and could just take everything with an ampersand, hexadecimal code, a semicolon, and replace it with the actual corresponding character. So anything that relies on that channel to communicate will be disrupted. Why are these both mentioned back to back? Because they actually are similar problems in a non-obvious way. They both rely on noise, which means data that is normally insignificant. The noise in an image is normally not something a user doesn't care about. And the status of characters on a webpage in the source code is also something a user doesn't care about. So in both cases, you can't rely on those channels to communicate information, excuse me, better. Because the sensors could introduce more noise or they could simply strip out the noise like the replacing characters of the hexadecimal encoding. So that channel becomes unreliable. So the protocol for sending pages from Bob's machine back to Alice has to rely on non-noise, which means you have to hide the data in the words on the page. And we assume that because the censoring proxy does not want to disrupt service to legitimate users, they can't actually modify and replace words on the page that might change the meaning because then the service will be useless to all the people who are trying to use it for legitimate purposes. So you can use the text on the page to communicate data. And the AD is a guy named Peter Weiner invented functions that will take bit streams and convert them into text, which looks statistically like English text, and then you have a function on the other end that can convert it back into the original bit stream. So that would be the kind of thing that you could use to communicate from Bob back to Alice. One general problem with all these things is that if the sensor notices that every time the user gets blocked from accessing a page, the next page they access is always the same one, they're going to realize that that page is a way of circumventing their system and that people are just going to it every time they get denied access somewhere. That means that they're using it to get around the system. There's no real way to solve this problem because you can't, the user has to be trusted. The user who knows about Bob's circumventor side has to be trusted, not only not to give it away, but not to do anything stupid that would cause the monitors to figure out what it's being used for. The only real thing that can be done against this would have Alice's client end tool, her proxy server that she's running. It can be configured that every time you're about to access a circumventor side, it can just warn you, just keep in mind that if you always access the site after you've been banned from a different site, then eventually that'll give the game away and the circumventor site that you're using will get blocked as well. Last thing I'll mention, what if you want us to do it securely in such a way so that even if a human being is monitoring Alice's network traffic, they won't be able to, they won't notice anything suspicious about what she's doing. And we were talking earlier about certain kinds of text queries where handling those words jumbled together will be, will look nonsuspicious to a machine monitoring you, but they would look suspicious to a human being and it could set off some red flags if a single user is being singled out from monitoring. Then this now means the text queries being sent from Alice's machine out to Bob has to meet a higher threshold for what is considered nonsuspicious. They have to look like the real kinds of queries that Alice would actually type in. One way we thought of to actually solve this problem is Alice could have a tool on her end which will record the query she types in at real search engines during normal legitimate web surfing. And then when she gets banned from a website and has to access Bob's circumventor site in order to get around it, she can use these stored queries that the tool on her end has saved a list of queries that she enters at search engines during normal usage. And then each of those queries would hash to some specific value which represents a few bits that she wants to communicate and so you would pick out the queries which Alice has entered in the past whose hash values correspond to the data that she wants to send out now. And also mention anything that Alice had entered a search engine that could possibly look suspicious. You want to exclude queries that contain terms that might raise red flags to the people monitoring her. But that takes care of the problem of getting the data from Alice's machine to Bob's and not having anybody notice it. Next step in the process is the pages being sent back from Bob to Alice also have to look non-suspicious. So I talked about Peter Weiner's mimic functions about how they take data streams and they turn it into text that looks like English text and then the inverse function turns the text back into a bit stream. And I was saying that the text looks statistically like English text which means a computer cannot tell what you're doing but the problem is the text generated by a mimic function is very obvious to spot for a human being reading the page. So that would not be sufficient. Instead, you have a real problem here because you have to generate something that looks like text which could have been written by a human but you don't have a human being that can actually write the text for you and you can't use a copy of some other page that has already been written or any text that has already been written because then the censoring process you could just monitor for sites which are always serving back copies of pages that were stolen from somewhere else. So that would also leave two suspicious of an audit trail so you have a problem of how to create pages that would not look suspicious to the monitor. One thing I came up with was that the pages returned by Bob's site are returned in a format that you get in a search engine where you have summary information for each of the listings of a different page and that will actually, the advantage of that is it actually looks like the kind of page that you would expect to be served back in response to the text query that Alice had sent out in the first place. Usually when somebody's entering a text query they're doing it so they can get a list of pages with the title and summary information and links and stuff like that and the data that would be being sent back from Bob's machine to Alice's machine it would be hidden inside the page titles and the meta descriptions being served on the page and Alice's software on her end would understand the same protocol and decode it in the same way. The only real weakness here is that a human being monitoring traffic both ways could see the query that Alice enters on Bob's site and could see the search engine results listing served from Bob back to Alice and they could notice that the search engine query results do not really correspond to the text query that Alice entered and they would conclude there's something suspicious going on. The only real defense here is that even if Bob's site gets blocked because a human being monitored the traffic and detected something suspicious Alice can say that Bob's site actually looks like a real search query site where you enter text query and gets results so she can have sort of the play dumb defense where if she's in a situation where somebody confronts her with records of her traffic and says this Bob is running a known circumventor site for getting around our network it's easy to point out that the traffic looks like real traffic to a site so Alice can claim that she didn't know what the site was for at the time. To anybody who's interested in helping to work on this problem, what I've been talking about for the past 50 minutes, just to summarize the cycles that we go through is really you come up with an idea for communicating between Alice's machine and Bob's server and then try to find some kind of pattern in the traffic that would give it away to somebody who's monitoring for the protocol and what I mentioned earlier was two traps to fall into the arms race syndrome where if you spot a known flaw in the protocol don't get into the habit of saying well this will delay them by a few product cycles and by the time that they have a fix for this we'll come out with another one because if it's possible to design it in such a way the first time around that it's secure you should do that instead of creating a temporarily insecure protocol that will put a lot of users in danger and also remember that the people administering the sensoring network do not have to allow anything they don't want to so just be aware that the only thing you can really rely on is the URL requests going from Alice to Bob and the pages coming back from Bob to Alice because anything, any other side channels like cookies, the sensors will not hesitate to block those if it is the only way to disrupt your protocol and it will not interfere with most normal websites. Lastly, I'm sorry I ran out of time before I could get to talk a little bit more about our website and how I got started on this. Our website is at peacefire.org and we actually got started as a site with information about how to disable client-end blocking software on home computers like surflotters, you know it's really trivial from a script kitty hacker point of view and it's not the kind of thing. I mean even script kitties would be ashamed of traffic and that kind of information is just so embarrassingly easy but it is an issue that we had looked at for a while and we're actually perfectly upfront about the fact that the information on how to circumvent blocking software is just, is a gimmick to get people to come to the site and read information that we posted about internet censorship and why we're against it and then when it got to the problem of how to circumvent network level blocking programs like they use in high schools that was when some of the more interesting questions came up which is how should we do this securely in such a way that users using it cannot be detected and that led into the development of the, basically the math problem that I spent the last 50 minutes describing which actually has pretty serious implications for not just people, you know, facing censorship from their local ISP or their school or something but if we can do this right it would actually be meet a great social need for this kind of thing in countries like China where it's a serious problem. I think what a lot of people don't realize enough is that things like PGP and zero knowledge systems and anonymizer.com which are useful to people who live in countries where they are allowed are not actually that useful to people who live in countries where just the fact that you're using encryption is enough to get you in trouble. So I think that, I hope that more work will end up being done on the subject of stichornography in general and circumvent accessing banned websites in particular. That's the last slide. So move on to questions in the, I think I saw that hand in the back before anybody else go. The question was how should Alice get the software in order to use it on, in order to use it to communicate with Bob? The assumption is that she knows at least one person, Bob, who was on the outside and she could obtain it from him. Of course, next question is can these sensors also monitor traffic between Alice and Bob so they could monitor for the exact bits that are being transmitted as Bob sends them to Alice? The sort of the assumption here is that the monitors are looking for general patterns of suspicious activity and any one single instance of activity like sending one PGP encrypted message from a user in the United States to a user in China. Or if you want, the assumption is if you want to circumvent the system once, that's possible to do. It's if you want to circumvent it repeatedly as a matter of habit, then you have to have more security to make sure that patterns of suspicious behavior don't aren't detected easily. Go ahead. Okay. The question was could the protocol allow you to have sort of a network of dynamically distributed Bob servers outside of the control of the sensors so that every time a person makes a query they'd go to a different one? That's actually something that was discussed a lot. The problem is that that model of sort of having a distributed network of distributed network of nodes so that you can plug into any one of them every time you need a request, that is more appropriate for things like FreeNet and Nutella where these are systems where they allow you to distribute files that are subject to set your ship and any time they censor one node the data gets distributed to all the other nodes as well. The reason that model works for something like FreeNet is that the overhead involved in censoring a particular node is fairly high. They have to get a court order and get the police in to go in and shut something down. The reason that model doesn't really work for this is because the overhead in censoring a particular node is not high. All they have to do is add it to their list of blocked sites. So if you are dynamically communicating with many different Bob servers then if your protocol gives you away as soon as you communicate with a server it will get detected and blocked. And then, so the reason the distributed network idea works for file sharing systems is because the overhead for shutting down a particular node is high. It doesn't work for this because the effort involved in blocking one node is very small for the censors. Does that kind of explain it? Okay, well, you get assured anyway. These are shirts that have the names and addresses of websites, I forgot to get assured the person in the back as well. They have the names and addresses of websites that have been censored by different blocking software programs for all kinds of ridiculous reasons. Time magazine is on here because.