 When you leave this tent, if I could get you to exit through any other door, except the one you came in through right there. Because people come in through that door, please leave through any of the other doors. Thank you very much. Also, as you know, the tent is one hour behind. All of the other speaker rooms are operating on schedule. If you want to see the next talk in Apollo, I suggest that you start queuing a half an hour before the talk starts. If you're here in this room, odds are you will be able to stay for the next talk. We haven't had to empty this room all day today. That's all I have. Also, please don't take any photos during a presentation at DEF CON without the consent of the speaker in advance. That's a new rule that's come to...what? Oh, no photos of the crowd or anybody else, unless you have consent of the people you're photographing. That's all. Are you ready? Let's give a hand for Christian. Thank you. Thank you. I assume everybody here knows and loves encryption, but I think steganography is much cooler because, in that case, the adversary doesn't know only what you're sending, but also that you're sending a message at all. So it's in some sense much better. And what I'm going to talk about specifically today is, well, in steganography, you always have to have some kind of cover message. The message the adversary sees that you're sending on the surface. And what I'm going to use as the cover is natural language. More common covers are images and sound, but we're going to use natural language here. Okay. So what have people done so far for natural language steganography? Well, Peter Wainer, Chapman and Davida did something where they essentially by hand wrote some context-free grammars and then these context-free grammars were to decode to natural language sentences, entire texts. And depending on which productions you would choose in the context-free grammar, it would encode a different message. The problem with that is obviously that you have to construct these large context-free grammars and both the send and the receiver have to have them. So one thing they did then they automatically extracted syntactic templates from text and essentially took the volumes of Shakespeare and produced a grammar that would produce text that looked like it came from Shakespeare. The problem is it was semantically terrible. Syntactically correct, somewhat, but semantically terrible. Chapman and Davida improved on that by doing synonym placement on existing text. So you would take some existing text and just substitute some of the words in that text by synonyms. Synonyms are words with the same meaning and thereby you get a semantically, syntactically coherent and rhetorically correct text. Except that you have to come up with good synonyms and you have to come up with this good original text that you're going to substitute on. So there are lots of problems with these approaches. And in particular with the synonym placement approach you have things that are synonyms in some context but not in others. For example I can say I eat a sandwich, I devour a sandwich, but I cannot say I devour. Devour requires a direct object. So a synonym is not always a synonym. And of course if you need this original text that you're going to do semantic substitution on, if you use the same text twice the adversary will say wait a second I just saw the same text with minimal variations and you're exposed. If the adversary can spot the original text, well can you differential text or knowing discovered text? So these are problems with the existing approaches prior to this work. What are these problems really come from? Well first of all automatic generation of synonymatically and rhetorically correct text is very difficult on its own. If you ever tried to write a paper it's hard for you. Now imagine how hard it is for a computer. Especially with all the inequalities in language. And what all of these previous approaches have tried to do is to try to mimic correct text. And as in all steganographic approaches you have some kind of cover object that the adversary expects. And if you deviate from this expectation that's giving the attacker an attack vector. And if you're trying to mimic correct text well you better produce really correct text otherwise the adversary can possibly spot you. So what we really want is something where incorrect text is okay. We want something where the adversary when he gets the text doesn't expect that this text makes perfect sense. It's syntactically perfectly correct. Ideally also we can solve the other problem that if the adversary gets the original text or some database he can't say oh this other text should not exist it's just a minor variation. It should exist together with whatever you use to generate your cover from. And as I said there should be errors in there that we can exploit for steganographic hiding of the hidden message. Well I'm sure you've already read the title so you know what I'm heading for. Here's some example from Babelfish. Took a site from a German Linux campsite. In Germany it reads, Keine Sorge. If you translate this to English by hand you get something like don't worry. They're all tame and also readily answer questions regarding the topic Linux. I'm glad to give a small glimpse into the word of open source. Now if you feed this into Babelfish I don't know I assume most of you have used these translations at some point or other say on hyzer or something. You get a concern it are all not handsome and also readily questions approximately around the topic Linux and give gladly a small idea into the world of the open source. Okay so first of all the not is kind of in the wrong place it's not a concern as supposed to it's a concern they're not tame. I didn't translate handsome at all because it wasn't in the dictionary and it adds an article the in front of open source and I don't know why it does this strange capitalization. But you can see they're playing the game. You might get some idea of what this event was about. So it's still useful for humans. It's still perfectly plausible that somebody sends you this. So you don't speak German he sends you a rough translation. So let's see how does machine translation work? Well machine translation first of all there are lots of low quality encoders out there. They don't know what to do. They don't know what to do. They don't know what to do. They can there's never a perfect translation anyway right. If you translate from one language to the other there are multiple variations of how you can express it in the target language. So if there is a variation in the translation it's okay. Even if you have a human translation and even boys with machine translation. What current machine translation systems do is they are statistical. They have a language model a translation model which maps words and sentences. They have some idea of syntactic and semantic rules. Mostly do pattern matching. They always ignore all of the context. So they look at one sentence at a time at least all of the ones that we found. Which can have some pretty bad results. Say somebody talks about pipelines and seals and certainly talks about sea lines in the middle of the translation because the translation picked the wrong word in the dictionary to translate seals. And so there's plenty of errors in machine translation. And now one other thing I should say even if these errors weren't there even if machine translation was perfect as I said before there's always some room in natural language so we're not just exploiting an error in current software. So how does the basic protocol work? Well we assume that Alice wants to send a message to Bob and well as usual they have some shared secret in our case what kind of specific translation engines are going to use and other things that define this configuration of the system and Alice first has to select some source text. And in our case it can be for public news source so you could just take text from BBC, the Bible, whatever you want, could write some on your own. It doesn't have to be secret for the system to work in general it can have advantages as you'll see towards the end of the talk. So given this cover source text, the translator configuration and the secret message Alice gives all of this to our tool which by the way is available both as a PHP demo and encode at the URL at the bottom of the slides. Don't go all there at once and try the demo. The server is a really really slow machine. It's processing PHP and you'll just kill it. However it works if you don't try it all at once. Anyway so Alice gives this to the Lawson translation system and as we would expect gets back at your cover translation. She sends that cover translation together with either the cover source text or a reference to it to Bob and Bob just feeds his shared secret again the cover source text and the cover translation to his copy of the lit system and gets back the secret message. So far so easy. So how does the encoding really work? First we take our cover source text and run it through several commercial and custom translation engines. In our case since we weren't really up to the task of writing our own machine translation software, well we go online, we query BabelFish, we query Google, we query SysTran and a couple of other engines that are out there and get back translations. Then after we get these translations we want actually more than just two or three variations because then the battery will be very low. So what we do is we add additional variations to these sentences making slight modifications that are presumably plausible for S.S. translation errors, possibly even fixing translation errors and we get even more sentences out of that. When we, as we do so, whenever we add these variations we compute probabilities that this is actually still a valid translation. And at the end, once we have lots of sentences, translations for given sentence and probabilities, we build a half-mentory and then we walk through this half-mentory based on the secret message we're trying to encode. So picture says more than I can speak into the microphone. Again, cover source text goes into commercial translators, which ones are again selected by the configuration. You can add new ones if you are so inclined to the tool. Some post-processing to get even more sentences, right? Then we have some model that will tell us how likely are these sentences actually good translations based on these probabilities we built a half-mentory and then based on the secret message we select which translation we're going to emit. We do this on a sentence-by-sentence basis because the translators work like that. And well, eventually we either run off cover source text and cut the secret message or we run off secret message and pair it with random bits. Now what does the decoder do? Well, pretty much the same thing, except that of course at the end, what does this mouse for? At the end we get the cover translation in, we still built the half-mentory and now we essentially go upwards and say which were the bits that were leading to the sentence in this half-mentory. And voila, we got our secret message. To give you some idea of what the post-passes look like, some of them very simple one, what we found when we looked at translations, one of things that are often wrong is that English definite versus indefinite, in German the wrong gender and so on. So that's one thing we can play with. Or some words do not have an article, should have one and vice versa. So that's something we can do. Also prepositions are very tricky. Should it be to, should it be towards, should it be from? So we can play with them a little bit. And again, we did some manual shuffling of what are prepositions that are close to each other that are likely to be mistakes. And then that's where the translations come from. And of course what you can do is if you have a dictionary that lists you how often a given word occurs in language, you can say, okay, this word is one of the 10 million words that nobody ever knows, so likely the translator didn't know it, so we leave it in the original language even if it was translated. Or vice versa, we have it in our dictionary, Baboffish didn't, go us, we can fix this error. So again, you get two variations just by doing this little change another post pass that I like very much is semantic substitution. We talked earlier about synonyms and how hard they are to get and that you have to construct them by hand. Now if we have a translation, this gives us a way to construct synonyms in a way that we call second order semantic substitution. So what we do is we have say, we have got a German word, say flach, and the original machine translated it to say flat. Okay, that's one choice. So we translate even, smooth, plain, plain, shallow, and so on and so on. Now some of these are semantically rather far away from flat. So we would not pick a random one out of these, but what can we do? We can translate flat back into German, we get flach, eben and glatt. So these are also again quite close now in German. And if we translate these words now back into English, see it's going back and forth here, then it says okay eben goes to even, so there's some of these words which kind of form a cluster, a cluster that shows that they're semantically quite close, and we can do that because we've got these two languages. And now assuming that the original translator might have taken a little bit of context from the current sentence and chosen a good translation in flat, we get good semantic substitutions here. Now if the original translator chose badly, well, then we're not going to do much damage anyway because it was one of these configuration parameters. Just like do you want to use this at all? And the tool itself, you can plug in new modules if you feel so inclined, you just have to study machine translation and figure out what is actually a plausible error. So I'm sure you all want to see what does it look like, right? So let's take an original sentence in German. In these times it is determined whether the students chose the right school and their skills. In Google, this looks like it's determined whether the pupils selected the correct school and whether it corresponds to its abilities. Quite a good translation, actually. Now, Lingua Tech, another engine that we're using, gives whether the pupils have chosen the right school and whether it corresponds to its abilities shall be found out at this time. If you use LIT, and this encodes 8 bits, so that's quite a bit of information, quite a number of choices that we had here, you get in this time it is toward to be determined whether pupils selected, working with students or students at school and they have chosen the right school. So it is not that much worse that you could say there couldn't be a translator that does this. He might notice the closeness to Google because in this case it is selected Google as the basis. But now assume you have a couple of translation packages out there and you don't know which one is chosen. You don't have this comparison directly at hand. Maybe in English, but hey, hard to tell that there was really something going on. Give you another example. This is from a film review. The American film, Windhouse, tells the story of two different generations of men who travel through Morocco. On the way, they search for what is important to them, the meaning of life. I'm not going to read the Google language translation. However, what it gives us is the Moroccan film, Windhouse, tells story from men belonging to two different generations who travel through Morocco. They're looking for the only one which is important to them on the way, the sense of a life. Now, the best translation would leave out the a. Well, Google and Lingor Tech put a the in there. And actually, lit has, as one of the choices, don't put anything. And well, we included the string lit here, so we got this one. But as you can see, it might even improve the quality of the given translation. And at least it's not necessarily a significant difference. So what are the advantages again? We can hide information within the limits of machine translation. And as machine translation improves, our system will also automatically improve in what it's generating. We avoid this generation problem by mimicking the results of an imperfect transformation. So we don't have to produce correct text. We can get away with producing slightly incorrect text. And what is also interesting, if you have a larger document with some rhetorical structure, we get that. The translation gets the rhetorical structure. If you do use any of the other generation systems, well, if you're lucky, you get one sentence correct. But you're never going to get a large book or article correct that way in terms of rhetorical structure and so on. We have a secret key that can be used to configure the system. And to that, you can have it very small. You can just say, OK, I'm using Babel and Fish and Google and Sematic Substitution with two witnesses. Or you can make it more complicated. You can say, I've got this machine translation system and it was trained on this corpus. Machine translation is statistical. So they are typically done in a way that you have some large parallel corpus of sentences in English and in German. And the machine translation is supposed to learn what a good translation is. And if you have a different corpus, you get a different machine translator. So you can make this entire corpus part of your secret, if you so want to. So you get lots of possible variations there. And again, for now, we can also say that the public cover text was from a public source. You don't have to write a new one each time, take a new image, or whatever else you used to do in the cover. What are the disadvantages? Well, the bit rate is not really high still, right? One of the big disadvantages of this text is always very robust compared to, say, binary data like images. And as I said multiple times before, we need to transmit both the source text, or some reference at least to it, and the translation for the decoder to work. Let's try to address these two issues. So what can we do to increase the bit rate? Well, we can create more machine translation systems. I'm sure the people who are doing machine translation will do that for us anyway, but that's one way of doing it. We can, as I said, create new corporates to train existing machine translation implementations. And that's actually quite easy. If I've got a parallel corpus of 20,000 words, I can select any subset of this corpus and get another corpus. So that's as easy as if I have the full source of a machine translation system and can train it myself. And of course, you can write additional post passes or prepasses that will make plausible modifications to sentences. So what we've done so far is we have a prototype implementation. And that, of course, means we have limited dictionaries. We did not write anything that would give us knowledge about grammar or semantics of a language. And we only have a couple of translation engines available to us. Still, what we get is a bit rate of something like 0.082 bits per bit transmitted. Now, this is for normal text. If I just desip the text, I get a much better bit rate, which is 0.022, which is a bit more anonymous. If I take a PNG or JPEG emission, hide something in it, well, I could compress this and get a bit. OK. Quickly, some minor variation that will become interesting soon. We could use the scheme not only for steganography, but also for watermarking. Often, watermarking and steganographic systems are rated, so this is quite natural to do. In watermarking, what we want to do is we want to be able to read some embedded mark from just the translation in this case. We don't want to need the original text. So what can we do here? Well, we could use some keytash of a certain set of sentences and take the least significant bit of that. And that's going to be our mark bit. And essentially, when we pick our sentences in a translation, we're going to go through all of the sentences, compute all the keytash, and torture it until we get a sentence that actually gives us the mark bit. Since we have lots of variations, we would statistically suck lead most of the time. And of course, we use multiple sentences if you need multiple mark bits. Which sentences are we going to pick? Well, we could just have some other secret mechanism that gives us sentence 5, 10, 17, or just at the beginning. But this makes the watermark kind of fragile. The adversary could just insert a sentence at the beginning, and suddenly all of our offsets are wrong. What is a better technique is to say we have some arbitrary mechanism to rank sentences, all of the sentences in the translation. And that just gives us some total ordering. We take the sentences that have the scores lowest on this ordering, and the sentences after that in the document are the ones that are going to hold the mark bits. This way, an insertion or deletion attack will not necessarily succeed in destroying the mark. Wouldn't be DEF CON if we didn't have some attacks, right? So what can we do about attacks? Well, first of all, there's the possibility of somebody spotting obvious inconsistencies. The most easy one that comes to mind is, say, you've got a translation which in the original text contains the same sentence twice, and translated in different ways. A machine translator would never do that. It's a deterministic machine, and now suddenly it does it not. So that's in some sense a simple bug in that in the LIT system, you just have to keep track of which sentences have you translated so far. And if you see the same sentence again, well, you're not going to use that to encode anything. You have to emit the sentence you emitted the first time. Similar, there are certain mistakes that machine translators make, like translate the plural of foot as foots should be feet, obviously, right? If you pick two different machine translators where one makes this mistake and the other one doesn't, and you alternate between them, you get inconsistent mistakes in your translation, the reader will wonder, OK. So a human might make this mistake once but not the another time, but a machine translator should make mistakes consistently. So that's another possible source of bugs, which you would fix by obviously writing some kind of post-pass or processing that makes sure that these don't happen if you can spot them. In general, essentially one problem that most diganographic systems have, somebody could construct a new statistical model that says all of the translation system in the world obey this model except for this one that applies the same to image diganography or something with audio files. Somebody has a model and you deviate from it and you shouldn't. How can you defeat that? Well, let's assume, just for fairness, that we also know about that model. If we know about that model, we can easily make our encoder aware of that model by sorting out sentences in the translation process that don't obey the model. And then we can't be detected anymore. What is interesting is that creating such a model is in some sense equivalent of improving a statistic machine translation because, well, what is this machine translator? It is a model of how one sentence corresponds to another. And in the end, it's an arms race. Whoever got the better model wins if the attacker and the defender have the same quality of machine translation model, then the defender will win because the attacker will have to say, oh, it's plausible. OK. Finally, the most fun part. How can we avoid transmitting the original document for the system? Remember the least significant bit idea for watermarking? Now, what if we take H bits in each sentence, the first H bits that tell us how many of these least significant bits in the hash are going to actually contain a message? So I'm a summer receiver. We have some small constant H. When I pick my sentences, I try to pick one where the first two bits, say, encode the number one. That means I'm going to transmit one bit. And then the third bit is going to transmit the actual message bit. Or if I have more variations available for this given sentence, I could transmit one where it has first two bits set, meaning the number three, and then follow three bits in the least significant hash that contain the extra message. This is a statistical way, just based on the fact that hashes are pseudo-random, that I can encode the message by just transmitting the translation. And well, I might fail to find one of the translations in the variation that allow me to encode this, but I'm unlikely to fail, statistically speaking, depending on my choice of age and how many translations I've got available for a given sentence. And well, since I can assume that I will fail eventually, what we can do is we just add forward error correction to the scheme, and then we can correct these errors that are statistically bound to occur. Conclusions, translation-based steganography is a promising new approach for steganography, in my opinion. The bit rate that we can achieve is lower than that of systems that operate on binary data, but we have the advantages, and we can defeat statistical attacks if the underlying machine statistic model that the attacker is using is made public to us. Questions? Can't have no questions. I give the slides to the DEF CON goons. They are, I can put them on the website, obviously. And there's a paper at Information Hiding Workshop 2005, which also describes the scheme, except for this last bit with the forward error correction. Okay, so the question was if we intend to add support for languages that don't use the Roman character set. Yeah, well, as soon as we get, I mean, the system itself uses Unicode. So if you give us a translation system that translates English to Chinese, you can use it. The problem is if you take language that not inter-European anymore, the distance between the languages is so big that existing machine translation systems just cannot cope with it anymore. You just get total garbage, and it's no longer plausible for somebody to send a translation of a Chinese document to English that is total garbage to someone else. The system would still work, but you lose the plausibility because you need a certain level of correctness of the machine translator to make it plausible. Okay, how do we detect tampering in the message? Well, we don't. This is a steganographic system. You typically, the goal is not to detect tampering in the message. You could, of course, just add a hash at the end or something or sign the resulting message, but the idea is the adversary doesn't even know that you're using it, so that the adversary tries to actively disrupt your communication is something that's completely out of the goals of steganography. No, it does not use the metric. There's no cryptography involved here at all. The shared key, we assume it's transmitted out of band up front by some means. We don't care. We just assume it exists. The two parties have it at the beginning. They have to know that they're going to use steganography anyway, otherwise the guy probably wonders why he gets this translation. No, we don't care about public key infrastructure or anything like that. No more questions? Okay, well, I originally told the DEF CON goons that I would give them half an hour talk and they gave me one hour slot. So, however, what I can do very quickly is very, very quickly talk a little bit about some other project of mine that the DEF CON goons didn't want to talk about. Don't know if you've heard about GNUNIET. It's a group peer-to-peer framework. Just like Phinear tries to do anonymous file sharing with running on Linux, BSD, Solaris, and so on, it's written in C code, so portability is not as given. Goals, as I said, anonymity, confidentiality, so all of the transactions are encrypted. Deniability, you're supposed to not, even if I break into your computer, you should be able to say, I didn't have these files on my machine. We're trying to make economic sense that people have incentives for sharing and, of course, it should be efficient. And very quickly, the upcoming, hopefully this year, 0.7 release, will have a very nice plugin architecture. We can have transports where we can send either fragments like UDP or streaming. While I've improved resource accounting, we can have, as I and Clark just said earlier, friend-to-friend or peer-to-peer topology. You have the choice. Cryptographic-wise, we use AIS, SHA and RSA. And if you wanna look at the new, nice features in file sharing, look for the cryptography behind K-blocks. We can associate arbitrary metadata with files so you can have add descriptions to your search results or in particular previews so you can see before you actually download the file some preview of the contents. We get quite a big speedup from using 32K blocks and we have a nice, shiny, new, glade-based user interface. You're currently at pre-5, I hope the next one will be 0.7.0. And if you care to learn more about GNU-Net, this is the information. I see us as a friendly competitor to FreeNet, which you've heard about just before. Questions? Right, yeah, that was one of the things, right? Can you just take anography with GNU-Net? Theoretically, yes, we've got this transport layer abstraction which essentially says, okay, if you can send smoke signals to the other guy, we can transfer messages. The problem is, look at the bit rate again. If you wanna do peer-to-peer with that kind of bit rate, you have a first problem. The second problem is, the second enographic encoder is quite, quite slow. So not only would you use extreme amounts of bandwidth, but also your CPU is just going to crash and burn. And the third problem is you need the shared secret and this machine translators in the background and usability-wise, we want to go for a slightly larger user base and the people will actually be capable of setting this thing up. Oh, and Lit is written in PHP and GNU-Net is in C so there will be some bridging code that will be also non-trivial. So theoretically, yes, in practice, no. And besides, why would you want to do anonymous file sharing if you use Degonography so nobody even knows that you're transmitting anything? No more questions? Okay, well, if you wanna see me in private, I'll be somewhere outside.