 So, this talk is about a recent development of the Great Firewall of China, and it's joint work between a bunch of people, most notably, Roya and David, who are both brilliant researchers working on fighting the good fight and coming up with techniques to circumvent internet censorship. And unfortunately, they cannot be here today, they didn't make it to Hamburg, so I'm the backup plan, so I'm the one presenting our work, and I hope I'm going to do them justice. And when I'm talking about the Great Firewall, then I'm talking about the Great Firewall of China. And academia has spent a lot of time researching this system over the last couple of years. So I'm maintaining an online bibliography where I can find a bunch of papers dating back many years. And at this point, we have a pretty good understanding of many aspects of this system. So we know, for example, what it blocks. We have lists of keywords, we have lists of domains, we have lists of URLs that are all blocked when you're trying to access the internet, and you are in China. And we also have an understanding of how this thing works. We know what it does. For example, inject TCP reset segments into your TCP stream to terminate network connections. And the same for DNS. And it's a little bit more challenging, but we also have some idea of where the system is in the network, right? Since there was some back and forth over a couple of years where we weren't completely sure if it's decentralized over all the provinces in China, or if it's a central system in the backbone of the network. But over the last couple of years, we got a better understanding of this. And we're doing pretty good. A big issue is still that it's difficult for us to come up with continuous measurements, right? All these studies and ours is no exception are one-off measurements. We do it once, we get the data, we publish it, but it could change tomorrow and nobody might notice. And my co-authors, Roya, for example, is working on changing that by proposing systems that can continuously measure censorship over time. And before I start with the meat, I want to give you a quick overview of the most common types of censorship in China. A lot of domains are blocked, right? So this is off-the-shelf DNS poisoning. There is really not that much special about it. Imagine a user sitting in China trying to connect to a web server outside the country. And what happens is your traffic is basically subject to deep packet inspection. So this shady person here represents a deep packet inspection device. It looks at your traffic, and if it finds patterns that match censored traffic, then what you get is a bogus response, basically, right? And I didn't make this IP address up. So this is actually what I got when I was trying to resolve Facebook.com in China. And for some reason, it's an IP address in the United States, right? It doesn't make any sense. And it's well known that the firewall uses a set of approximately 10 IP addresses it likes to return to basically misdirect users. We are not completely sure why, but for some reason, this is the kind of DNS poisoning they are doing. And interestingly, as a client, if you wait long enough, so just a bunch of milliseconds, you also get the original response from the DNS server, right? Which is funny, because if you're sitting there in China and you're trying to resolve a block domain, you actually get two responses. So this is basically a race condition, right? The firewall is trying to give you the censored domains faster than the real genuine DNS resolver. And this works pretty well for the firewall. Not always, sometimes the original response is faster and reaches the client faster than it doesn't work. But you basically get two responses, and the first one basically wins. But like I said, there isn't much special to this. And there is also keyword blocking, right? So if you have blocked keywords in your DNS requests, same story, a deep packet inspection device looks at your get requests, and sends you a TCP reset segment, and it's the exact same story, right? It's a race condition. They're just trying to send this reset segment to your browser before the actual web page. And you might think HTTPS fixes that for everyone, and that's kind of true, but it doesn't help you that much since you still rely on DNS, right? So the DNS censorship is really the most effective system there, right? Since it just precedes HTTPS. It doesn't help you that much without DNS. And to work around that, the basic idea is to just take your network traffic and wrap it in some form of cover protocol. Something that tunnels your traffic and gives you maybe even security, right? Fancy properties like authenticity and confidentiality. And there are a lot of tools for that, right? That basically try to wrap the traffic for you so it's harder for deep packet inspection boxes to filter stuff. And this is really a problem for the firewall, right? Since traffic that is encrypted is difficult to parse. And at this point, deep packet inspection boxes might be left guessing what is in there? What is in this encrypted connection? Is it HTTPS or perhaps VPN or Tor? And it's not entirely impossible to learn these things since you can look at a bunch of things, right? The most obvious one being port numbers. If it's port 443, it's probably HTTPS. If it's 993, it might be IMBS, right? And you can also look at the type of encryption. A lot of protocols use TLS, right? That there are other protocols out there such as SSH that use their own crypto that do not rely on TLS. But if you look only at the protocols that use TLS, then you can still look at the specific setup of TLS, right? Since a lot of protocols use TLS differently. They configure it differently. And it's possible for sensors to look very closely, very carefully, and identify these differences, which can then be used for a blocking decision. And if all else fails, then you can still look at flow information, right? There are packet length that leak information. There are timings. There are the direction of packets. And all these things can be used, right? But all it gives you is basically you can make an educated guess, right? Often there is just not information in there to really know that this is something you want to block, right? So there is some uncertainty left. And this is poison for sensors, right? Since uncertainty leads to collateral damage, meaning that they end up blocking more than they really want to block. And who remembers this incident from a couple years back when the Great Firewall was messing with traffic to GitHub? And over a couple days, they gradually increased the censorship. And it looked like they were trying to see how far they can go. How much can we prevent access to GitHub before people start yelling at us, right? And this went on for some days until at some point GitHub was completely unavailable. And I think the reason was a circumvention tool that was hosted on GitHub. And there was some public outcry, right? People didn't just tolerate that. And at some point they ended up reverting this thing. So they just unblocked it again. And this is really a good example here that collateral damage really hurts, right? And this is a useful thing for us to know because we can design our circumvention protocols in a way to maximize this collateral damage. But the system, the Great Firewall is designed in a smart way, right? So they came up with a way to eliminate this uncertainty and to gain certainty when all you have is information to an encrypted protocol. And what they came up with is called active probing, or that's what we call it in our research paper. And it's actually really simple to understand, right? So again, imagine a TLS connection, an opaque connection between a client in China and a server in Germany. So the Great Firewall in the first step closely looks at the TLS connection, right? What is it? The handshake, since it has some information in there, which could help us learn what is in there. And in the case of Tor, it's looking for something that is called the cypher list of a TLS client Hello. This is basically a string that the client sends to the server to tell it what cypher it supports. And it turns out that this is more or less unique to Tor, so the Tor project tries to adapt it to a Firefox talking to an Apache, and this works very well for a while until Firefox moves on and does something else. So it's an annoying game to play, right? And this is what the Great Firewall leverages. So they look at this and at this point, they might think, hey, look, this could be a Tor connection. We're not completely sure, because maybe there is some unknown protocol out there that looks exactly the same. But we have no idea, right? We need to have certainty, because if we end up blocking it and it's this super important financial application that screws with our economy, then we're not much better off, right? So what they do in the next step is they launch a short-lived computer, a short-lived probe that connects to the same server in Germany and tries to speak the Tor protocol, right? And this is just a guess. They don't know if it succeeds. They just go ahead and try it and see what happens. Worst case, the server has no idea what this garbage is and just terminates the connection, but perhaps it answers with a Tor handshake. And if that happens, then the firewall has what it needs, right? It sent basically decoy traffic. It tried to fool the server into replying, and when it did, it knows exactly that this is a Tor server. And when this happens, you can block it, right? You can just prevent access. And this is quite neat from an engineering point of view. Since you can do this without really bothering with IP addresses and IP churn and DHCP, it's completely dynamic. All you need to do is look at traffic as it traverses the country border, and you inspect it, and then you probe stuff, right? So it's a two-stage system. In the first stage, you do deep packet inspection on a lot of traffic, and a small subset of the traffic is suspicious. And then to understand what it really is, active probing is used. And I'm talking about Tor here simply because it's the most interesting example. So there are a lot of other protocols that are being actively probed. So in fact, the system seems to be modular, right? They can write easily modules for additional protocols. And so at this point, what we did was a research project. So we started compiling data sets. So the first data set, we ended up calling the shadow data set. And it's basically what you would do if you would tackle such a problem, right? You try to get vantage points in China and in two different ISPs, right? So there is Unicom, which is a very large ISP in China, and there is CERNET, which is short for the China Education and Research Network. And it turns out that they filter slightly differently, right? Which is quite interesting. So not everyone is the same when it comes to packet filtering, right? So some people have filtered more than others. And what we did was we had these systems in two different networks, and then we repeatedly established Tor connections to bridges under our control. So bridges are basically Tor relays that are unpublished for the sake of censorship circumvention. And we established three different types of connection. A normal Tor connection, which isn't particularly useful for censorship circumvention, and the protocols OBFS 2 and 3, which are designed for censorship circumvention. And we did that for a couple of weeks over and over again, every approximately 10 minutes, right? So we ended up with a lot of connection requests. And we found out that actually this wasn't as helpful as we expected, since we didn't end up with a lot of probing IP addresses, right? So we created a second dataset, which we call the Sibyl dataset. And this is basically a client in China that connected to 600 different ports, since we just used an IP table redirect to basically fool the firewall into thinking that we're running 600 bridges on a single machine, right? And it basically looks the same. And that way we ended up with a lot of different probing IP addresses. And finally, we had a third dataset, which we call the log dataset. And this is really just a web server whose log files go back several years. So it's really cool because you can look at the logs and it tells you in some way how active probing evolved over time. All right, so the first thing we were interested in was where all these IP addresses are coming from, right? So we looked at our three datasets and all together, we had a little bit more than 16,000 unique probing IP addresses. And unique, right? So there is very little repetition. 95% of them were seen only once. And we also did reverse lookups and we found strings like ADSL in there. So it looks like all these IP addresses or a lot of them originate from ISP pools, right? And the who is record basically suggested the same. So this is not a small part of IP addresses that is reused over and over. This is a large pool from basically ISP pools. And those are not users, right? The active probes, those are automated systems. So this was really odd to us. And there is some set intersection and there is one IP address among all of them that shows up a lot. So almost 50% of the probing activity used to come from this single IP address. And this system seems to be somewhat different from the rest because it seems to be a dedicated scanning machine. So I think it's not online today, so don't bother trying. It used to be recently. It also had an open SSH port. But this... But it seems to be a dedicated system that is somewhat different from the rest, right? And what this means for us is I cannot just come up with a block list of active probes and give it to all you guys and say, just put this in your IP tables rules and no prober will ever show up again. There are just too many of them. Our list isn't even exhaustive. There are more than that, right? So our assumption was maybe they are hijacking all these IP addresses, right? Maybe they don't even own them. Maybe they just borrow them for a short amount of time to probe and then give them up again. And it was really odd because it wasn't possible for us to talk to these probes other than the probing connection. So the port scans didn't work because all the ports were filtered. We couldn't trace route to them because trace routes would time out a couple hops before the destination. You couldn't talk to the probes. They basically showed up at your doorstep, but not the other way around. So there wasn't a lot of interaction possible. So we were left basically looking for patterns in all the data we gathered. And we did that in a systematic way, going basically bottom to top in the TCP layer models. So we started with trying to look for patterns in the IP layer. And we didn't find a lot of interesting stuff. You get a somewhat narrow TTL distribution, which isn't particularly surprising and can be explained with routing. But we found much more interesting stuff at the other layers. For example, all these 16,000, not all of them, but a lot of them used the entire 16-bit port range, the source port range, which isn't something you would find in many modern operating systems since they don't use the well-known port range as source ports. But they did, right? We also found odd patterns in the timestamp value. I'm not going to discuss that since there is not enough time, but you can find the details in our research paper. And these things seem to suggest that this might not be a normal TCP stack in a kernel. Perhaps it's a user-space TCP stack. If you're programming somehow to maybe scale their probing activity better, we're not completely sure about that. And the most interesting pattern we found was in the initial sequence numbers. So I'm giving you a quick reminder how it works. TCP initial sequence number are a 32-bit number in the header of TCP segments. And they have the nice property that they protect you against off-path attackers since attackers have to guess your sequence number interval, which is supposed to be hard, right? And it does that by simply randomizing initial sequence numbers for the TCP SIN segment. And this can be visualized, right? So this diagram shows you the distribution of a lot of sequence numbers over time, over a couple hours, and that's the x-axis, and the y-axis is the value of the initial sequence number, ranging from 0 to 4 billion. Every dot is a TCP SIN segment that I captured. And I didn't just make this up, so this is actually derived from a modern Linux kernel and it just made it send SIN segments to a destination machine. I extracted all the SINs and plotted them here. And it seems like a random pattern, just like you would expect. And you would expect the exact same thing when you look at the proverbs, right? The active probes from China. If you take all these SIN segments and you plot them, then you should see this. But what we actually saw was this. And it's really weird, right? It's a slightly skewed pattern. It goes up like this, and you can actually connect the dots, and then it becomes a little bit more obvious. And this is a zigzag pattern. It goes up until it wraps back to zero, then it goes up, and so on, and so on. And it turns out that the sequence numbers are perfectly correlated with time. So they simply derive sequence numbers from the timestamp, right? That's what's apparently going on, which is really interesting, since this is not one machine, remember, right? Those are the sequence numbers from hundreds of machines across a lot of different IP addresses. So this is basically state leakage, right? And we find more odd stuff in the TLS layer, right? So when you establish a, or the Tor protocol stack is, you have TCP, then you have TLS, and then you have the Tor protocol. So on the TLS layer, we looked at the client hello. And again, it looks just weird. It's not something a normal application would send. For example, there is no randomly generated SNI, like Tor clients send them to servers. It's just missing entirely. The Cypher suit seems to be unique as well. It's just not something a Tor client would send. And to get an idea of how unique it is, we recorded the Cypher suits on a Tor guard relay. And obviously we did not record any IP addresses, so all we know is basically a frequency distribution of how popular a specific Cypher suit is. So we ended up getting more than 200,000 Cypher suits in 24 hours. And the specific one the probes use is super unpopular. It's really just 67 entries. And we don't know where they come from because we didn't record IP addresses, but the point is this is absolutely not common. And maybe they were from China. We don't know, but it's really odd. This is not something a Tor client would send. And if you go up the final layer, this is what probes do when they connect to a bridge after establishing a TCP connection, after establishing a TLS handshake. They simply send a versions cell to our bridge. Our bridge responded according to the protocol specification with a versions cell and a net info cell, which is part of the handshake. And at this point, the probe closes the connection. It doesn't care anymore. It doesn't bother creating a circuit since it probably already knows what it needs to know, that this is a Tor bridge. It doesn't need to do anymore. And again, but just looking at how it does it, this is not a reference implementation of a Tor client. It looks like something handcrafted in a way. And like I mentioned before, there is a lot of state leakage, right? These probes leak state and we can observe it across IP addresses. And it's a really interesting question to think about how the system is designed, right? And we have a bunch of sections on that in our research paper. And unfortunately, we don't have an answer. We have some hypotheses. Some of them may be more likely than others. We don't really know. So maybe if you have some ideas, you should get in touch with us because we're really dying to know how this thing works. And one hypothesis is that it's a proxy network, that it's a geographically, a set of geographically distributed proxies all over the country. And the firewall somehow tunnels its traffic over all these proxies and uses it to scan a server. And personally, I think that might be a little bit too much engineering effort and it could be done easier. But again, our data is somewhat contradictory and we cannot really rule out any hypotheses. A second one is that maybe they have a server which is sitting in a data center at an ISP connected to a switch port. And whenever they want to borrow an IP address, they somehow update the access control lists on the switch and inject packets and try to talk and basically hijack the IP address for a couple of minutes until they are done and then they can give it up again. But this is just a theory. Unfortunately, we still don't really know. So I've talked a lot about the structure of this system. Another question is, how well does it even work? And this is what our shadow data set is able to answer. So we basically established connections, if you remember, over and over again. And every dot in this diagram represents one connection attempt, right? It either fails or it succeeds. So the successful ones are on top and the unsuccessful ones at the bottom. And we have both for Cernet and for Unicom, our two ISPs. And the one really cool thing I want to point out is look at the top lines of both. So sometimes it succeeds once. And 25 hours later, it succeeds again once. And this repeats, right? So it's a repeating pattern of being able to establish a tour connection every 25 hours, which is really weird. So again, we don't know why. So one theory is that perhaps they're flushing their block lists every 25 hours and update them again. That maybe takes a couple minutes. And in that time span, people can do whatever they want. It's one theory, right? And after all, it looks like all the quality assurance for the Great Firewall is done by academics anyway. So you might as well trust us and not the operators of the Great Firewall on that. It's not clear. So that's the effectiveness. And how fast are these probes, right? In 2012, it looks like they were batch processed. So it looks like there was a cron job running on a system. And every 15 minutes, they would invoke that cron job to basically process all the probing tasks. And it seemed to work fairly well. So this is a visualization we created back then. And apparently, the system was improved not too long ago. And they basically turned it into a real-time system, which is really interesting. So this was derived from our Sibyl data set. And on the x-axis, you can see the ports of our data set from 0 to 600. So for every port, you can see how long did it take the firewall for a probe to show up. And the delay is on the y-axis. And you can see that most points are very much at the bottom. So they show up really quickly. In fact, the median arrival delay is really just 500 milliseconds. So most of the probes only take half a second between the initial connection from a user and the follow-up active probing connection. So these days, it seems to be fairly efficient. And there are a bunch of outliers in there we cannot fully explain. So you see all those peaks, and then they go down in a linear fashion. We're not completely sure. So I think a lot of these things represent implementation artifacts of the Great Firewall. But then again, in a way, we're doing the quality assurance for the Great Firewall. So I'm sure someone is going to learn something from this diagram. All right, so so much for the design of the system. And I only talked about Torso far, simply because it was the most interesting example, since they put a lot of thought into that. But there are a lot of other protocols that are being probed by the firewall. And this is an incomplete list, right? It looks like it started in 2011, when somebody wrote an article and published it and wrote about how he received those weird SSH packets from China with a seemingly random payload. And back then, it wasn't really sure what it was or what it was used for, right, since apparently it was just probing activity. There was no blocking going on. So probing doesn't automatically equal blocking, right? There was a lot of probing going on and never a blocking decision. So maybe there is just experimentation going on. It's not completely clear. There was probing for OpenVPN for a while, for SoftEther, which is part of the VPN gate protocol. Tor has already talked about. Also for Google's appspots, since there used to be a small and really clever circumvention protocol named Go Agent. And I guess that perhaps they came up with this probing activity to detect those Go Agent instances. And maybe there is more. I'm sure there is more out there. And maybe someone even has an idea. So please get in touch with us. And we also looked closely at how the Great Firewall deals with OBS2 and OBS3 probing, right? And again, it doesn't take a long time to figure out that there are oddities in there. You wouldn't expect in a normal connection. And again, it looks like they're just not using reference implementations. Which I find surprising, right? Since all the software they are trying to probe from Tor to OBS2 and 3 to OpenVPN, it's all free software, right? You can just download it from a website and you can throw it into your probing system and let it run. But still, for some reason, all these things seem to be handcrafted in a way. Maybe to be more efficient, or maybe nobody wanted to bother with figuring out how to properly use the original software, it's not clear. But the ironic thing is that this handcrafting makes it possible for us to fingerprint the active probes after they fingerprinted all the traffic from circumvention protocols, right? Because just like circumvention protocols leak state the active probing system does as well. And what we find are things like the padding that is being used in OBS3 looks somewhat different, right? It's according to the specification, but instead of sending one TCP segment or two segments, they're just sending one, right? And we even found cases of duplicate payload, which you really shouldn't find in the data because the payload is supposed to be uniformly distributed and the odds of getting the same payload across two IP addresses actually, even for just one, should be really low. So in that case, two IP addresses showed up and with an identical payload, right? And again, it looks like they're just leaking state all over the place. And remember that we also have this log data set, which is basically the logs of a web server and it allowed us to look back into the past and how probing evolved over time. And we ended up creating this diagram. So we could identify a lot of probes in the data. We are fairly certain that it should be free of false positives and it dates back to 2013, right? And the magnitude of the protocols refers to basically how often we saw it in a given day. And this is really cool since it gives you some idea of how the firewall evolved over time, right? Like I said, it's not completely, it's not a complete diagram, but we can learn some things from it. Like up to the end of 2013, there was apparently a lot of probing activity and then for a couple of months, it almost stopped. There's some trickle still going on, but we don't know what was going on in the timestamp. It was just not that much probing. We can also see how new protocols were introduced and maybe even tested, right? So that's really cool to see how all these things started to happen over time. But again, take it with a grain of salt since it's just not complete. I'm sure there is more to this than what we have captured in this diagram. And on our project website, we have a bunch of instructions that you can run on your web servers or on all your boxes to find your own probes, right? And this can be as simple as grabbing for specific post and get requests in your web server logs. For SoftEther, for example, or for the AppSpot probes. And you could also just be done for this peculiar IP address, as I mentioned earlier. So it looks like there is no activity these days, but that's probably a really foolproof way to maybe get some interesting stuff. And it doesn't do just probing for circumvention protocols. I've even seen that thing in my web server logs for some reason. So maybe somebody can find some additional information about that. And the really interesting thing about the active probing system is that it's active, right? A lot of censorship systems tends to be entirely passive, right? Or mostly passive. They just sit there, they wait for suspicious traffic and then they terminate your connection or maybe they drop protocols. They do whatever it takes to make the connection go away. So they are more or less passive. There isn't, you cannot interact with it in any way. And this is entirely different for this active probing system because as the name already says, it's an active system that connects to you, that talks to you. It has to talk to you because otherwise it cannot block you. And the interesting thing is that it's nature, it's active nature, makes it possible to interact with it in unexpected ways, right? And we don't have a complete list, but there are a bunch of things you can do to make the life of the active probing system a little bit harder, right? So one is how large is the block list they have, really, right? So we had this civil data set and we redirected 600 ports to a single Tor port, which means that we basically put 600 new entries in their block list. But that's just 600, right? Every computer, every IP address has more than 60,000 ports and you can use them all, right? So every single IP address is able to add 60,000 new entries to their block list. And that's what you can do with one address and then you can continue to scale horizontally. And a single slash 24 network can add 16 million addresses, right? So imagine if all of you would team up and do something like that. So I guess it would be pretty easy to hit the limits quickly. And you can do something very similar for file descriptors, right? So file descriptors are OS imposed and it's not clear what the limit is for this system, but by simply establishing a lot of connections and keeping them open, you can maybe get closer to that limit, right? So you just attract a lot of probes and then you don't let them go away, right? You just don't send data. You establish a TCP connection but you don't send them data. So you keep them waiting, right? And at some point they will probably time out so you probably have to scale it horizontally too, but it's not clear what happens if the system runs out of file descriptors, right? Will it just stop probing new systems or what's gonna happen? It's not clear. So maybe this is a way to just prevent additional circumvention servers from being blocked. And my personal favorite is probably this one and I have to take a gift credit to the authors of the VPN gate paper, which is really great. And VPN gate is a circumvention system that is based on VPN servers, right? So the idea is you have a lot of VPN servers. I think they have several hundreds across the globe and you try to give it to censored users. So they compile a list of IP addresses and they give it out to users and they then select a VPN server that works for them. And if you were in charge of the great firewall, the easiest thing to do would be to just have a cron shop that fetches this list every day to just extract all the IP addresses and they add it to a block list, right? It's pretty straightforward. But the thing is this gives control to the people who run VPN gate, right? Or to whoever distributes those nasty IP address lists, right? They basically can control the block list of the great firewall, which is insane, right, if you think about it. And it's not just about the IP addresses of VPN servers, it's just IP addresses, right? So what happens if you put Windows update in there and entire China is no longer able to update their windows because the great firewall is just randomly pulling in IP addresses from a list. The same is true for DNS root servers who basically break DNS for a country that way and also Google infrastructure, right? And they discuss this in great detail in the VPN gate paper. So they have a bunch of paragraphs, a bunch of sections in there where they really talked about this cat and mouse game which was super fun to read and it was a matter of days basically, right? They talk about on day one we did this and then the great firewall did this and we reacted by doing this. And I think it only took two or three days until the firewall operators noticed that it was possible to inject arbitrary IP addresses into block lists and at that point, they started to verify addresses apparently. It's not completely sure how, maybe they just check to the address belongs to or try to connect to it, but the point is to make sure that don't just swallow whatever you give to the firewall, right? So that's personally my favorite way to mess with the firewall. And so far I've only talked about how this system works but not much about how it can be circumvented, right? And it turns out that not all is lost and we still have a lot of ways that we can use to have connectivity in China. And one way is to just look at how the great firewall deals with TCP segments, right? Since when I talk about DPI, you probably just think about pattern matching, right? Finding byte sequences in TCP segments. And there is actually much more to it than that, right? Since you need to reassemble the TCP stream, which is actually not a trivial thing to do if you are dealing with network packets that were crafted in a way to make this super hard. And that you can do, right? So to deliberately make it hard for them to reassemble. And we actually exploited that a while back since at this point, the great firewall didn't reassemble streams. And maybe for performance reasons, it's not clear why, right? So after all, it's just simpler to scan packet by packet without reassembling the stream. And we wrote a tool that basically manipulates the TCP window size on a server which can be used to instruct the clients to break its signature into two pieces, right? And that was enough to circumvent the DPI engine, right? So you're basically just instruct the clients to break the signature into two pieces and the firewall is no longer able to identify it. And this worked approximately for a year until it was fixed. And it was never meant to be sustainable, but it was a nice hack, anyway. And there is even a research paper where they looked at all these nifty little hacks that you can do, how you can format your TCP segments in a way that it's really, really hard for the firewall to actually reassemble it. Since there are a lot of ambiguities when you're trying to do that. And this is really neat since some of these things are really difficult to fix for the firewall, but unfortunately it's also really tricky for us to exploit, right? Since it really boils down to me telling people to, hey, why don't you just run this kernel module I wrote over the weekend, right? It manipulates your operating network stack in a weird way and the great firewall is no longer able to scan the traffic. But these things are just a nightmare to deploy and as a result, we haven't really seen a lot of it, right? And it gets a little bit easier with cool APIs like the kernel's LiveNet filter API, but still it's just not very reasonable to deploy. So what turned out to be much more successful was the Torre project's pluggable transport idea. And this started a couple of years ago when the Torre project noticed that actually it would be much easier to fight this arms race if we decouple the anonymity part from the circumvention part in the Torre client. So we just take the circumvention part and put it into a separate program, basically, right? So instead of two Torre server and client talking directly to each other, what you have is an additional proxy sitting right in front of these two Torre instances and they scramble the traffic in a weird way in between them, but the Torre clients basically just talk to an additional SOX interface. And since it's SOX, it's really easy to use for a lot of different applications. So the cool thing is this is by no means specific to Torre and in fact, a lot of different systems started using it. So there is basically an open specification about it at this point. And it's particularly useful for a lot of VPN providers that are trying to give access to people in censoring countries, right? So they just put a pluggable transport server, which we usually call OBFS proxy in front of it and that often solves the problem. And it's a really flexible system since you can modify payload just like you can modify flow information, right? You can deal with things like packet lengths. And there are also APIs for several different languages. It started out in C and then we noticed it's actually a little bit cumbersome to write those modules in C and then we re-implemented it in Python and now there is also one in Go. And on the right, yeah, you can see a visualization of the payload of vanilla Torre versus OBFS 2 and 3 and every pixel represents the value of byte. And the point of this diagram is if you look at the top left, you see that vanilla Torre doesn't look very random, right? And what you see here is basically the TLS handshake since it just doesn't use the entire byte space in its handshake. And this is what some circumvention protocols are trying to fix. And at this point, we have at least four protocols that work in China. So ScrambleSuit and OBFS 4 were specifically designed to be resistant to active probing since they rely on a shared secret that is given out of band, basically. And if you cannot prove knowledge of this shared secret in your network packets, a ScrambleSuit and OBFS 4 server just won't talk to you, right? And there is also Meek, which is really exciting because it makes use of the collateral damage problem I mentioned in the beginning. So what it does is it tunnels all the traffic over content delivery networks. So the technique is also known as domain fronting. And this is really cool because the idea is that if you are a sensor, you are forced to make a decision. Are you going to block this content delivery networks and everything it hosts? Or are you going to let all the users pass and use it for circumvention? And ideally the answer to that is no, I'm not gonna do that because it would be insane because there are just so many websites hosted on CDNs these days that it should just be too much for a sensor to do, right? So that's the idea of Meek. And FTE stands for Format Transforming Encryptions and it's a neat hack to basically format arbitrary byte streams according to a bunch of regular expressions you provide. And the cool thing is all these systems, this isn't just science fiction. So they're developed, they're tested and they're deployed. So when you download Tor Browser today and you start it, then you get actually a window like that and you can select whatever you prefer among these circumvention protocols. And a lot of more of them are in the making right now like a web RTC-based transport which is going to fix hopefully a bunch of nasty problems and which looks really promising at this point. And that brings me to the end of my talk. So we have a project website where we have our research paper and the data set and just like it should be with scientific research it's free and open, the code and the data. And if you wanna get in touch with us those are all our email addresses. Thanks a lot. Okay, perfect. Thanks so far, Philip. We have about 13 minutes left for questions. So if you have a question for Philip please line up at one of the six microphones in this room so we can get you on tape and everybody on the stream can also hear your question. All right, then Mike Won, please go ahead. Thank you for the talk. Mike, I got a question. Are there any similar things known to North Korean internet or do they censor in a similar way or a censor way at all or? Yeah. The question is if there are similar systems in different countries and I'm not aware of any, right? And I wouldn't be surprised if that happens at some point since that kind of stuff is easy to put into DPI boxes by Western countries and then sell it to other countries, right? So I'm not aware of anything now but I wouldn't be surprised if it happens soon, unfortunately. Okay, quick thing before we continue to Q&A please if you really do need to leave before the Q&A ended, please do so quietly so we can pay some respect to the speaker and finish the Q&A. So please, Mike, too, go ahead. You were wondering about the probes where they were coming from. Many computers in China still run on Windows XP and the virus ganners are from state companies. So what about the botnet? Have you thought about that? Right, that's a good question. We thought about that and we cannot rule it out, right? So our thought was it would be hard to keep that secret, right? I mean, then again, maybe nobody ever noticed. Maybe. Mike, three, please. Did you do any analysis in regards of IPv6 because it adds complexity and new challenges? Right, unfortunately we didn't. So I hear that rumor a lot that IPv6 is censorship resistant simply because DPI boxes don't support it very well. I don't know if it's true, it sounds pretty cool, but unfortunately I don't have any info on that, no. All right, do we have a question from the signal angel and the internet? All right. Question from Alex, CCP2. Has any engineer from the inside leaked anything about the great China firewall? I'm not aware of anything, but that's one of the reasons our email addresses are on the slide, I guess. All right, then just another question from ISE. What reason is there to think that China only operates probes from and inside China? Oh, I see. So I guess it depends on the infrastructure too. Maybe it's just easier technically and so there is a technical reason and I'm sure there is also a political reason since you can do that in your own country, but I guess if you start messing with that stuff in different countries that has a huge political impact and I guess people tend to be smart enough to not ignore that. So yeah, I think those are the two answers that there are technical and political reasons to not do that, but there is actually research that looks at the collateral traffic of DNS injection in China, right? DNS poisoning. So traffic that traverses China is also subject to DNS poisoning even though the sender and the receiver are outside of China, but that seems to be accidental as far as I know. Okay, then another question from the room, Mike Foer, please. I was wondering if they ever blocked the source instead of blocking the destination. Oh, I see. Not by IP address, I think. So the interesting thing is they block, I think, the SINAC segment of TCP connection which is kind of similar to that in which we find surprising, right? So usually when you think you wanna block a TCP connection, you would just drop the SIN segment since it's just the first part of the handshake. But that's not the case when you're trying to connect to a blocked server, the SINAC from the server is dropped and we don't actually know why. People had a bunch of different theories ranging from access control lists of switches to maybe to fool people into thinking that trace routes succeed, we're not completely sure, but in that case, it's similar, yeah. All right, then another question from our signal angel, please. Janix wants to know, has anyone tried to use the probes against them, like prepare a honeybot, then when the probe arrives, try to fingerprint it, analyze it, exploit it? Right, I'm not aware of that, but again, I think there are more ways for this system to fail, like what if you could start adding all these probe triggering signatures to really popular protocols, right? And all of a sudden, the system is supposed to scan all these targets. So I'm not aware of anyone exploiting probes, but I guess there are more ways to trick them into doing things they're not supposed to do. Okay, I think Mike Five has been waiting for a while. I just looked through my logs and I found some of these requests, for example, for the VPN service, et cetera. Do you need some more information for a dataset? That would be amazing if you could drop us an email. Awesome, thanks a lot. All right, I think we have plenty of questions from the internet, so we'll go back to that. Tuxel is curious. Do you have any idea about the algorithm that is used behind initial sequence number that GFV probes use? We don't. What seems to be clear is that they derive it from time, but the exact algorithm, how they do it, I'm not aware. Okay, all right, then back to Mike One, please. Do you have a clue about the technology stack they are using? I heard something of Western companies that are involved in the Great Firewall. I think there is actually enough know-how there to not having to rely on that. And yeah, the technology is a good question, since it looks like that might be a TCP stack in user space and we're not sure. So if someone is an expert on that, maybe they could just look at the packets. We have the Pcaps online and maybe infer what kind of user space TCP stack is used for that. Okay, all right, then Mike Three, please. Hi, now some impulsive thoughts after hearing about your research. The sequence numbers and also the IP stack range. I was wondering if you've thought about maybe a single machine or a few machines that are actually playing on behalf of multiple addresses. I've played in the past in a single local area network. Basically I was able to respond to packets with different identities with different IP addresses. So maybe if they could do that on a higher level, they could steal identity and then send packets from different IP addresses. Asking if you thought about that or this could be possible. I'm not sure if I fully understand but it sounds like something to look into, yeah. We have some more patterns in the TCP-TS Bell that actually suggests that there are like a small handful of physical systems, like maybe 10 of them. I've played a bit in the past, I was having a server and I was able, using a Python script to monitor the traffic and respond to requests from a single machine to respond to 100 IP addresses. So I would dynamically decide if I'm going to respond or not. So maybe they could be doing something like that on a higher level and deciding whether to respond on behalf of an IP or not. Just an idea, right? Yeah, thanks for the hint. Okay, because there's plenty of people we're gonna stay for another question at mic three. Can you give an estimate on which mechanism, DNS poisoning versus deep packet inspection? Which one is more effective overall for the censorship thing? Right, so deep packet inspection is basically just finding patterns in inside packets, right? So you basically use DPI to do DNS poisoning. So maybe you meant the reset injection as opposed to DNS poisoning, I'm not sure. No, I mean in comparison, if they just block the IPs and prevent you from accessing the servers or if they just say, okay, you can't access facebook.com because we don't like the website. Right, so it looks like countries are generally moving away from IP addresses since it's just a mess to deal with, right? You wanna be independent of endpoints and not deal with things like IP address churn and that kind of stuff and content delivery networks. So it looks like there is a trend to move towards being completely dynamic and only look at flying bytes instead of endpoints. Because for the DNS poisoning, that sounds like it could be just solved with a DNS stack. You just authenticate the responses and you pick the one which is authenticated? Well, sure, I mean there are ways to fix that. You could also just wait until you get both responses and pick the second one. Yeah, I guess, yeah. Thanks. Okay, for the last two remaining minutes, please be quick and precise with your questions. Mike, two please. There are a lot of papers about the GFYW and you talked about the fingerprints the GFW is using and does the research you are doing on those fingerprints and the publication of those researchers lead to a change of those fingerprints of the great firewall itself? Right, so that's a good question. They change anyway over time, right? People start using different systems and those fingerprints change, but we're not talking about days or weeks here generally. So those changing fingerprints, it feels like it's a matter of months and sometimes almost years, right? Since, I guess there are just not a lot of people working at the great firewall who have time to adapt to new technologies all the time. So I guess the answer is they do change but I think talking about it in public is not the major driving force for the fingerprints to change. And are they adapting them to react on the research you're doing in the Western world or are they just changing it for other reasons? So I'm sure there is some adaption going on. So one important thing is nobody cares if you have your own circumvention tool that is used by five people. What matters to governments is circumvention at scale, right? So no matter how good or how bad your tool is, the moment thousands of people start using it, they're gonna look into it. So that's what really triggers people start working on it. Okay, so we are unfortunately out of time, so sorry for the remaining people having questions. Philip's gonna stick around probably and also you've seen his contact details. Yeah, huge applause. Thanks, Philip.