 I'm from the University of Aachen. This will be not a deep technical talk. Perhaps you might consider that being sucking. But if you want to pass some kernel structures in your mind, you can come to me afterwards, and we can do that with a beer. This is about hidden data in document formats. Many people assume I'm talking about or thinking about telegraphy when I'm talking about that. But I'm not interested in putting things in. I'm more interested in other people and other programs putting things in without knowing me. Basically, the thing is I'm at the university, and we have a lot of people working in theoretical security, and they discuss convert channels and how hard they are, and look a lot into that from the theoretical perspective and the maximum bandwidth the convert channel has. And I wanted to look in that from a practical perspective and see how much data is out there on the Internet we don't know about when we publish it, or when we publish the documents. Just a small demonstration. If I create a Word document, a simple one, save it, and then look onto it with a strings command. I see there's a lot of stuff in it already. For example, the text I've written in the document, twice, my full name, the name of the format I used, the name of the Word version I used, an email address, or something which is looking like that. I have no idea what that means. Some text strings, pardon? I can try. So basically, no, I can't make it bigger, it didn't work. Basically, I did strings on a Word document, and there's a lot of stuff in it. That was just a teaser, so not really important information in there. OK, and the problem are complex data formats we don't understand, and we're not supposed to understand. The people creating the programs which write this data formats have basically just trust us by our products and use them and don't care attitude. And even if the people publish the structure of their documents on the data format, we are usually not willing to understand them. If you try to figure out how Microsoft Office documents structured internally, you usually or probably will get mad about that. So in all those documents that I convert channels, which tools did I use to look at the problematic documents? Mainly, I looked at Word documents but into some other documents, too. At first, I tried Word document converters, which are meant to convert MS Office documents over to text or something like that. Anti-Word, CatDoc, and Word 2x. Some of them are not very well maintained. Others are very good maintained. But basically, that didn't work out that well. My first idea was to convert the Word document to text, then use a strings command to find other strings in the Word document and see which weren't covered by the conversation process. That was cumbersome and no fun. I then tried to look more into the metadata. There's something out there called Laula, which is a collection of documentations and Perl programs dealing with binary file formats of Windows programs. It has several programs in there, for example, something to get the trash section out of Word documents. I didn't know there was one in there. Get metadata in the Word document out there, the structure, and even password resolving. The problem with this Laula is that it was last touched five years ago, and Microsoft Office documents have evolved since then. And it was written in Perl, which I consider no fun hacking. So this was a fun project, I went on looking for things. If somebody really tries to dig deep into the Office document formats, he probably should check Laula because it seems to have a very sophisticated understanding in the OLE steams inside the Word document. Then there's WifiWare, which is used actually by AbiWart and K-Wart at least was trying to put WifiWare as a library into their program to read Microsoft Office documents. The development is somewhat hard to understand. I really wasn't able to figure out which is the latest version and who is developing what and what is the latest name. Basically, I just used the Debian default package to test it. It has tools to convert Word documents to text, HTML, and some other formats. And it has tools to read out the meta information in Word documents, VV summary, and VV version. Just a small example of them. Is it big enough for you? So it's completely unreadable. That sucks. The problem is because I didn't took the work to port it over to MacOS. I did this test on a Unix machine, and so I can't show it now to you. Basically, it can show you the version of the Word document, the summaries, the subject, the author information, all that you put into the metadata of the Word document. If there are comments in there, the template used, the Microsoft Word version, the number of pages, the number of words, the number of characters, if it's encrypted, and which code page is used. Oh, that really sucks. The programs in there include converting to DVI, HTML, LaTeX, PDF, RTF, getting this meta information, text, and so on. And it actually works quite well. It's probably something like viewing web pages and links, or links. They don't look great, but at least you learn what's that all about. Then there's a very interesting tool, WordDumper, which was done by Richard Smith of Computer Bytes Men, and he once had it on his homepage, published it there, but now he hasn't, and even in the Wayback machine, there's nothing to find about it, and had Google too, not. But if you write to Richard Smith at the Computer Bytes Men com domain, he is willing to give it out. At least he gave it to me. I think before continuing, I tried to get the screen resolution somewhat better. Is that more readable for you? Not really. Sorry about that. So now we will browse to some examples of hidden data in documents, not only office formats. You are very, very invited to shout out if you know of other examples, because I think that's mostly a thing. We just stumble across by chance. It's very hard to systematically research this hidden data in the documents formats. One very obvious way are mail and news headers. I think the average nerd knows about them, but nobody else, and many programs put a lot of stuff in there. The program version, the news hosts used, the mail hosts used, and so on. IP addresses from the machine connecting to the news server. So in use net, it was, when I last used it, very usable to flame people because they were using Outlook, and those people were completely astonished how the flamers could know they are using Outlook because there was an additional header in there. It might be embarrassing if you sell one kind of news or mail software, and there's a header from another software in your postings because you used another software. A very interesting incident was the biggest gem ISP, T-Online, which put your customer number into the news headers for abuse, tracking, or something like that. And the fun thing was, in the mid-90s, your customer number contained your telephone number. So everybody got your telephone number who was able to show the headers of your news postings. A year ago or something like that, they stumbled on that again. The German government passed a law that every home page needs an imprint. And T-Online, hosting millions of home pages, created an automatic imprint service for the home pages hosted there, or not service. Practically, you had an imprint, there was an imprint put into your home page without you knowing or asking about that. But the home pages also could be addressed by the customer number, which weren't identical to the telephone numbers now. But so you had to posting in the use net, and you wanted to flame the guy or something like that. You just went to the home page server of T-Online, looked for the home page related to the customer number by just adding the customer number after homepages.t-online, DE, and then went for the imprint file, which was always called, if I remember correctly, underscore imprint HTML. And then you had the name and the address and the telephone number of the person doing that posting. Again, people were not aware of that, and were very, very astonished when they wrote something and they thought they were anonymous, and then somebody else was posting a reply with the address, telephone number, and name of the poster. Config files, we could consider config files, not really documents, but at least it's something which bites us all the time too. The average software has so complex config files, at least unique software, that you really can't just skim over it and have difficulties to understand what all the options are doing and which options are missing in the default config file. This can result in security issues with misconfiguration, but also in disclosure of information, the prime example is Apache, which is in the default configuration, really fully telling every attacker or whoever else in the HTTP headers, which version it is and on which operating system it runs and which modules are in there. You can change that by just one configuration directive, but you have to know. More fun probably is Bitch X, which in the default configuration has a message that if you leave the channel, Bitch X, before you leave it, post to the channel, this guy is too stupid to configure his Bitch X or something like that. HTML, complex programs generate complex HTML, for example, if you look in a file generated by Microsoft Office, it's nearly unreadable for the average guy knowing just somehow the HTML. But even less complex programs and less complex HTML contains often information not really needed in there, for example, the Meta Generator Tech, which has bitten to my knowledge, for example, several times web companies which were using other tools that they claimed they were using, web design companies. And it's surprisingly often that there are still paths to local files in HTML pages. So usually people have this local files in their personal folder in Windows and the folder is named Joe Smith's files, so you can deduct the author of the file. Also in HTML files, there are often comments left by the developers or commenting out content which is considered inappropriate or something like that. I have found comments like this really sucks, we have to redesign it, but the most often the comments are just structuring the file which is very nice if you are trying to grab content from an HTML page of a news page or so. If there are comments in there like news item starts here, news item stops here, news item starts here, news item stops here, that's very, very easy than to grab the content of that page. One of the really fun things where defaced web pages, I think three years ago people from attrition org spoke here on their experiences with their defacement mirror. And I thought it was one case, but yesterday I asked them and they said in one of 500 defasements, the defaced page, this page was rooted by lead Hexor, contain links to files on the local hard disk of the defacer, and often there was a name of the defacer in there. Yeah, and there's PDF, PDF looks somewhat open, the standard is documented and so on, but if you ever have tried to write a PDF parser, it looks much less open and Adobe is all the time extending it and putting the latest and greatest stuff in their own applications and after that start documenting it. One of the main problems is the problem of censorship or redaction, if people use PDF files to blacking out things you shouldn't know, that usually doesn't work. Incidents with PDF files, there's a very famous sniper letter, which was in the Washington shootings, there were these letters left by the sniper and the Washington Post published one of the letters and blackened all the information which wasn't for the public, for example the number of the bank account where the money should be sent, and it was very easy to just remove this black bars and read the money of the bank account and so on. Then there was a report on the labor conditions in the Justice Department and again they blackened very much on that and it was very easy to remove and very famous was under the Freedom of Information Act, people had gained access to a paper describing the actions of the CIA in the 50s in the Iran and half of the document in certain places was blackened and this could be removed again. So how can we exploit that in PDF? In many instances it turns out that the data was hidden by setting the background color of the text to black and the text was black. We still then could mark with Adobe Acrobat Standard the text and copy it. If the text is a graphic because it's scanned in, we can copy that or we can copy it and put it somewhere else or if the black bars are overlaying graphics we can just cut them. People usually think PDF documents are not that editable at least if you are from the Linux world but if you are willing to shell out all the money to the whole Adobe creative route you basically can do every manipulation on PDF file and PDF is just another vector graphic format for you. So that what you can do about PDF files really boils down how much money you are willing to shell out to Adobe. A demonstration of that. So this was a diversity analysis by KPMG for the Department of Justice and let's see here's some black text and if I go now for select text and just select it, that's a little bit difficult here because the text also is a link. So copy, use a random text editor, no not this. And that is lost formatting but okay. So the next demonstration as well remove the image under the black bars that is a sniper letter. Again just acrobar standard. I think that's the original version like it was put up by the Washington Post. And you see select text. I can select this text but not the other stuff because it's a graphic. So I go for select image, select that. Copy, go to another program. New graphic from clipboard and that is. I wasn't able to remove the black bars from a document with acrobat but with Adobe Illustrator. This is no really a problem. Okay, since Illustrator is not for layout I only can open a single page. I'm pretty sure that within design I can work with multi-page PDF documents but I don't have Adobe InDesign and I think it would be wrong to steal it. So you obviously see there's a scanned text and the black bars are vector graphic because it's so sharp. I can mark it, move it around and kill it. So I was really astonished about PDF and by far I had the most fun with PDF documents. Sorry, dealing with MS is not that much fun. The office document format I said above is incredibly complex, undocumented. I think they documented the version of Office 2000 or so but it's a terrible mess and ever changing and it's well known that it's full of unwanted data but I think nobody knows what's exactly in there. I think the first time there was public discussion was in 1997 where it turned out that in every office document there's a general unique ID, UID, of the author and this UID also contained the MAC address of the machine creating the document and this was one of the factors which helped to find the author of the Melissa rewars because it was a word macro and therefore a word document and I think it was also used in convicting him. Microsoft found out that it probably is not a good thing at least they thought were falling to the public pressure and gave out a patch to remove this UID from documents but also a patch to patch Office 97 to not put it in again. So there's news from CNET Microsoft admits privacy problems and plan to fix it. Shortly after that, Richard Smith found out that there's also the history of the last 10 file names this document was saved to, looked into the yearly report of Microsoft and found out that it was created on the MAC because the MAC uses colons to separate path names and Windows machines slashes. You can be pretty sure if you see a MAC file name and Jerry Lace G3 desktop folder really looks like a MAC. A nice thing to note if you can read it in one of the lines of the middle the Microsoft yearly report was also saved in temporary items auto-recovery. So probably while editing it, their own program crashed. Okay, but Microsoft plans to fix the problems. So for Word 97, there's a knowledge base article how to minimize metadata in Word documents. That's good. Oh, they haven't fixed it in Word 2000. How to minimize metadata in Word 2000 documents. But the future's there, the next millennium. How to minimize metadata in Microsoft Word 2002 documents. And you saw it coming. How to minimize metadata in Word 2003 documents. And the Office Suite for the MAC is now Office 2004. I think there's no Windows 2004 version. I wrote them and I hope there will be a knowledge base article shortly how to minimize metadata in Word 2004. So what's in there? We don't really know because the document format is not well documented. But if you look into how to minimize metadata document, there's how to remove your username from your documents, how to remove personal summary information, how to remove personal summary information when connected to a network, how to remove comments, how to remove headers and footers, how to remove revision marks, how to turn off FastSafe, how to search and remove text format as hidden, how to remove hyperlinks in documents, how to remove styles in documents, how to remove old file versions from documents. So they are in the document too, in the same file. How to remove links to file codes that are specific to the machine the document is added in, how to remove the template name and location, how to move routing slip information, how to remove the name of the previous authors, how to remove your name from the visual basic code, it's in there too, how to remove visual basic reference to other files, how to remove network hard disk information, or hard disk information, embedded objects in documents and may contain metadata, document variables may contain metadata and general suggestions about security. That sounds like a whole lot of stuff in the Word document. I'm very unhappy I wasn't able to find all of that because I used only very crude tools to analyze Word documents. Some famous incidents with Word documents, probably the most famous at last in Europe, is the U.S.K. Erachtoschee. Then there was in Germany a very interesting thing where it basically came out that they made up numbers for a huge planning project. And the thing with the Melissa Vivos I've already told you that I caught the Melissa Vivos by looking for the UID of the original posting of the virus. This is an Erachtoschee. Basically in Great Britain, the British government published a dossier that the Iraq is so dangerous and has weapons of mass destruction and is going to bomb the whole world in the next week if they don't start a war. Probably no similar stories. But the document about that they put on the web contained this autosave information and the path of the file hinted very strongly to the names of four people. And so it turns out that basically the document was a spiced-up version of a more or less internship project of somebody in the phone network ministry there. I think that nearly, this thing is still going on and I think it's at least nearly coasting the British. Oh, what is it, president? No. No, not text payer. Tony Blair, he's Prime Minister, exactly. His chair. The other document was in Germany and there are several huge projects underway. For example, the Germans have this very nifty magnet-livitated high-speed train but nobody wants to buy it. And so they want to use it theirself, use their own docks, eat their own dog food. And there was a study saying, oh, if we build this thing to the Ruhr area, we will make shitloads of money by that. And they published a document on that and basically if you went back in the edit history, they say that you saw that they added a zero to all the income and removed a zero from all the costs. Other documented incidents include that there was text from a completely unrelated document and another document. So somebody that was posted on the wrist digest edited the document, saved that, closed the file, created a new file, wrote something in that. In the new file, if you looked there with strings or something like that, you saw that there was a text of the first document in there. And that is this edit history, data deleted from a document, which is overwritten, appears in the file. So to some other format, which looks so nice and so unproblematic, but image formats usually contain at least a comment field. And this comment fields are not used that much. So most imaging software puts its own name in there, created by the GIMP, for example. And that might bite you in the butt too if you are selling some imaging software and there's something in, then as created by something else. That's as a comment then. And JPEG itself has an extendable header format for adding metadata like camera type and exposure and color calibration information and thumbnails. And there was a very remarkable incident with XF thumbnails in JPEG documents, which is quite juicy. Some moderator put some picture of herself on her web log, and the picture was created by cropping the original image. And the problem was the original image still contained the thumbnail, still contained the uncropped image, which is really, I wouldn't have thought of that checking my thumbnails before putting out my images. Other things in document formats. I think very famous was the Star Report, which was created in WordPerfect and then converted several times. And then the Star Report as published first several footnotes turned up, which were deleted before and came up in the conversation process again. And we're along the line, here this dump S is probably lying or something like that. Very, very many document formats embed serial numbers, sort of a serial number of the software you used to create the document, GUIDs again. Then there are, in the document sometimes, especially with shareware information that the software you were using is unregistered, that may be in a mail header, that may be in a footer where you see it, but in complex document formats, it might be anywhere. You really don't have any idea if you don't know the document format. Another example is Adobe InDesign puts a list of all the fonts you have installed on your machine into the documents. And if you are a print shop or something like that, you're probably very heavily into font-verse trading. And having a list of all your wares in every document you give out is a problem. So I really wanted to know what's really out there. I think writing a document in Word, saving it and looking for the things in there is not that interesting and I don't use Word that much. So I went out to the internet and tried to get as many office documents I could get. I called the web. I have a tendency into writing web crawlers. So basically, I was like the director of the first Blues Brothers film. Yeah, I have this film with the skies. And then I have a car chase. And it would be so great. And I'm like, it's car chase and car chase and kitchen. Ah, and then the other skies. And then I'm like, again, a car chase. Oh, I make something about documents. And then I write a web crawler. It would be so cool and so fast and not multi-threaded, but unblocking. I go on to R. Okay, so I spent most of my time writing the web crawler. You can get it from my homepage, the recent version. And yeah, originally I wanted to hack all this Word document, parsing tools, but there was no time because I had to spend all the time writing the web crawler. And my office machine now in the US has a very juicy internet connection. Basically, I can really do five ambit to the internet without a problem. So writing a web crawler was really fun this time. Before I always had to do it from my home DSL line. Okay, so download the documents and then have a short look on the documents, save them and save them on a DVD for further research. That was at least what I did. It was not what I had in mind when I started the project. The really problem is how do I detect the data and web documents? The strings command is the help, but with Unicode, it doesn't work that nice and if there's some binary or compressed data in that, for example, that thumbnail, I have no way of finding the stuff without knowing something about the document format. While I was writing my web crawler, some other guy published a document on that in the IEEE security and privacy. And here's some interesting results. You can Google the title of the article and find it on his homepage. He basically refrained from writing a web crawler, just used Google for find the documents to find the documents and so he had some time to look at the documents. But when looking at the documents, he took a different approach, so probably I have to contribute something to that. Okay, about what? Crawling the web fast is fun and hard. I think I told that again. My actual crawler used for that is called Lens 3. The first one, Lens 1, was actually a crawler for Gopher Space, so it goes far back. And I used it for directed crawling, testing that so really hunt for word documents and not trying to get all the internet. Yeah, you can get the address down there called 23 new code Lens 2. I think it is there. I wasn't able to check because the internet is somewhat difficult here. I think there are fats tapping all the lines and that's why it didn't work. I think hackers wouldn't break the internet. When I found out that I spent too much time writing my own crawler, I tried another approach, Niels Pro was crawl, which is not doing directed crawling. But if you want to do fast and simple downloading of word documents and don't care about bandwidth and nice code, just grab his crawl thing and apply a patch. Again at core 23 new, which crawl originally was built to spider for JPEG images, not because Niels likes porn so much, but because he was looking for steganographic data in there. Yeah, question? That's a zero for sure. That's a lead group. You know there are this core SA people which are not that lead obviously because they don't have a zero in there. Yeah, and at least Niels Pro was thing was fast if not efficient. And I think all in all, I got about 150,000 documents. I have now like laying on my hard disk and see what to do with that. I used search engines, the result pages of search engines to feed them into my own crawler and then the crawl crawler. So I used random words and document type word document in Google and other search engines and the result pages I saved and used the URLs in there to start the crawlers. I saved all documents with the MIME types. Yeah, you see them. It's very interesting. There seems to be no standard for MIME typing MS Word documents. And even if there was no Microsoft Word or Office specific MIME type, I saved the files if they were named doc, PBT or XRS. And with the PowerPoint files, I really hope to analyze them later for speaker notes and hope to find some juicy details in there. So I collected, yeah, told before 150,000 Word documents and basically I just used all document converters and document information finders to batch run and to batch in batch mode on that and skim the results saved to a file by hand and by my eye. Yeah, while writing a crawler, which is very interesting thing, I saw unbelievable misconfigurations of web server. I think the funniest one was status code 226, which makes no sense in HTTP, but in SMTP. So I had seen a web server thinking it was a mail server. So can you read that? Oh, that sucked. Okay, I used the tool by Richard Smith, Word Dump, he sent to me. And this is a document by MS Office 3000, which contains very, very little information or at least Word Dump, which stopped being developed, I think at 99 can't pass it. For example, to go back, Word Dump reports the document as created by Word 97, build date of the file 8th of April, 2003. Some other document is created by MS Office and it contains still this revision log, so the last 10 file names. This is an example for a document which has the GUID in it, which is created by Word 97.2 or at least the Word version being built in 1998. The GUID is the thing in the fifth, last line or something like that. So this is a very interesting document because of the five times it was saved, of the 10 times it was saved, five times it was saved with the name Auto Recovery. So probably the guy using the machine to edit this document has a machine which is crashing very, very often. If this is from a computer shop, you probably don't want to buy your computer there. This is a very interesting document I found on the OECD website, which is an international organization and it seemed to be, I never really looked into the real document. It seemed to be a protocol about a session they had and it had 71 revisions. So while I don't see this, only 10 names in the Word document has at least saved the number of times it was edited. And this document was edited 71 times and you wonder, a protocol of a meeting edited so often, have you changed it afterwards? It's very strange. Oh, here we have again a case of many auto recoveries. I think that was that with the Word examples. I had another one which was very nice. The document started being edited on a Mac with the name Customer Files Letterhead One. And then this was iterating the names until it was information brochure or something like that on a Windows machine. So obviously you could see the whole odyssey from the graphic designer to the people using the documents afterwards. So the conclusions, you never know what is there in proprietary file formats. And for example, the Microsoft Knowledge Base article had so many hints on further information in Word documents and how to remove them and I wasn't able to find them. So probably if you have some time at your hands try to pass all this proprietary document formats like passers for them and spider the internet, I can send you a DVD with 100,000 to 150,000 documents and see what's in there. Open formats are only part of the solution. For example, Keynote with which this presentation is done saves its presentation as an XML file. But this XML is so complex you really don't know what's in there. At least if you don't spend your whole time in learning the format. And spider web look for interesting files on other people's pages and enjoy what you find there. So are there questions? Yeah. Auto recovery? Yeah, sure. Not only the file path but the file name but the whole path which usually contains the server name and so on. Yeah, you're right. If you want to, I still have a bonus track and I think I have another five minutes. The goons are far away, they at least need five minutes. Okay. At least for PDF, there is special software already to scrub them. I forgot the name of the software but I found the demonstration on the internet. You have a PDF and you mark what you want to be done and then the software not, oh, removes it. And the software not only puts black boxes there and perhaps in the image, I don't know but removes the real letters and exchanges them with minus signs. If you want to go a step further perhaps it should be mentioned that Microsoft is also offering a plug-in now to scrub information from MS Office documents but who knows what to scrub and whatnot but you at least can try to use that. If you want to go a step further people have long discussion if there's another hidden information in documents just by the kind you write. The way you build your sentence the kind of words you use and this was discussed a lot in the Cypherpunk community if you can find the author of an anonymous posting by just machine analyzing this posting and comparing it with others. And I've atled it in Program U-Mask I think it's very bad name naming a program like a Unix system call Unix Shell program but this is to my knowledge the first implementation trying to implement that you have a group of text files you say they are from the same author and U-Mask can generate a fingerprint of them and then you can compare that fingerprint with an unknown file. I think I can present that so you're still unhappy with the font size I guess. Okay. Yeah sure. Yeah, but I'm not aware of any implementation of that up to now. Yeah I think this guy wrote a book author unknown about the Shakespeare thing and how he found the author of some anonymous book about the Clinton election but to my knowledge it was just mainly not a fully automated process but it was just a tool helping him with manual analysis and you could use this program by Dave Aytel to completely automate it for example use net and I'm seen I have less than two minutes you might just believe me that I fingerprinted some text by Dave Ahmed in this file Dave Ahmed Pkl and now I run it against a text file which is by the same guy and now I have the match value of 275 and if I run it against the file by somebody else I get a much lower value and since this is no digital thing no binary thing, same author or not this value might even help me to find out if writer is from the same ethnic group or something like that uses the same words and construction of sentences. You can Google for this program just search for Dave Aytel and you mask and it's very very raw and very undocumented but I think it's very interesting to toy around with that or by this book author unknown. So thank you very much, it was a pleasure.