 This session is about metadata. My name is Chema Alonso. He is Jose Palazón, Palaco. We are from Spain. I'm working in a small security company in Spain. I'm also a Microsoft MVP, but I don't work for Microsoft. He's just an award on Microsoft Geese to San Technician around the world. He is working for Yahoo. He's working as a security engineer. And today's session is about metadata. We are going to use metadata hidden info to recognize the networking infrastructure of a company. And we are going to use to do this a tool. This tool is free and you can download it from the Internet just after finishing the talk or just during the talk. So if you are rushed, you can download the tool and go away. So it's easy. Well, probably most of you are aware about metadata and metadata security problems because in 2003 there was a hot topic in the news. In 2003, the Iraq war was about to start and the United States wanted the United Kingdom to be an ally. To obtain this help, the state sent a document to the United Kingdom in which this document proves that in Iraq there was massive destruction of weapons. Do you remember this story? Hands up? Yeah. Well, when Tony Blair presented the document to the United Kingdom parliament, the parliament asked Tony Blair if someone had modified this document and Tony Blair of course said no. No one from my team had been editing, modifying or translating, cutting and so on this document. But in the end, the story told us that someone had been working with this document in the United Kingdom and more than three people had been working with the document. So it was a very big scandal because they had lied to the parliament. Do you remember that story? Yeah? Well, in the end someone had to sign his demission and it was a very big problem. This is the end of the story because after happening this everybody is aware of the metadata and nobody is publishing a document without cleaning the document before. Or nobody is sending an attachment with a PDF or dog file or XML file without cleaning it before. How many of you have sent a document the last week without cleaning it before? The rest of you have been cleaning the document or just don't send email from here? Well, let's see the story. So far everybody that has been talking about metadata has been talking about this kind of data that you can usually go to the properties dialog and then you go and see who created the document, which company is the document for. But we want to talk about some more information that is there that you should definitely not be there when you're publishing the document. We call some of this information, we call that hidden information, it's kind of metadata but you cannot actually edit. Like what is the template that you use when you created the document for the first time? What is your printer? Even if you never printed the document your printer information is going to be there. What is your database structure? Like if you do this combined mail or however it's called your, the tables and the columns that you use on the database to create those documents are going to be in this hidden information. And there's also a lot of information there that you don't know that it's there. Like you copy and paste something and then you're copying something that you don't know that you're copying or the typical text is the same color as the background, like stupid things like that. Go ahead. And we want to treat all these three kinds of information because at the end of the day, metadata, lost data and hidden information can become one the other. Like metadata can become lost data if the programs don't do the things the right way. Like for example when you have a doc document and you want to export that document as a PDF document some of the metadata is actually going to become information on the PDF document itself, not on the metadata of the PDF document. It's going to be attached at the bottom actually and you probably won't notice that this information is there. If you publish this document, everybody will be able to see that. When you have information on your document and you publish this document, search engines are going to index this information and they usually need to have a title for the result or some of them are getting more information like what is the file type or what is the author and they are inferring this information sometimes from metadata but sometimes like for example a .txt file doesn't have a title but Google is going to tell you a title in the result. Sometimes this title is going to be the first line of the .txt or something in the footer. If that first line or that footer you didn't know the information that was there that can be kind of a problem and it's going to be indexed. When you have one file and you embed this file into another file you go to the parent file, go to properties and you will see the metadata of this document but you have an embed document there that has metadata that you cannot edit so this metadata is becoming hidden info and so on, all this information can if you have information that you couldn't edit before and then a new version of your application lets you modify this information you can say that that hidden information is becoming metadata. All these three kinds of information are at the end of the day. This is an example of how from a .txt file you can see the author, Google did this the author of this file didn't do this Google just parsed the document and decided that DTREA is the author of this file so you can see here how lost information has become metadata for the engine. This is a very funny one. There is this publishing company they have like hundreds of books on security topics and there's quite a cool functionality on the website because if you buy one of the books you just go to the second page there's this table with codes so you go to a form in the website and with this code you can download the PDF so you can get the electronic. They are selling the books on Amazon and Amazon has this look inside and look inside is going to give you a random chapter and also the first pages of the books so you just have to go to Amazon get the look inside look at the second page go to the site enter the code which is actually always the second more examples like on the left side you have a document these two are from Novel you have a link and the link is actually pointing to she see backslash document and that's a username it's giving information from someone that was just downloading the document from home about the username of the box I really like this one on the right side this guy was copying some script from one of his servers and then when he pasted on OpenOffice OpenOffice recognized the format of a URL and then automatically creates a link so the guy sees it and there was an IP there he goes, removes the IP and he puts this IP underscore CLM but he didn't go to actually edit the link so if you do the mouse over on the link or you actually click the link this is an IP in this case this is actually an IP that you can access and you can see here more information this is started with metadata but following the link you can see that he's using Apache on Windows Windows and Apache I think this is part of the Novel Microsoft Agreement maybe The question is are people aware of this and the answer is no even after what happens with Tony Blair people are not taking care about metadata, hidden info and lost data and it's funny because you can extract a lot of information from the network infrastructure from big companies, software companies and even security companies so just as a couple of examples we are going to look into the FBI files this is a government organization and they should take care about the documents if we look for Office document into Google we can obtain more than 4,000 documents if we are able to extract one metadata one internal data from each file we are going to obtain a lot of information a lot of data from the internal network and this can be very bad for this website so we are going to use FOCA but we are going to use it later in an advanced mode now just a quick introduction FOCA friends, friends FOCA and we are going to analyze a document from the FBI it's an Excel document this document is in this URL you can download it all the documents using FOCA of course and we are going to analyze this document, just track and drop extract metadata and look the info that we got inside with only one file we can discover a lot of information two users, a printer which is shared in a network server this is an internal server so we know that this user has access to this server so we can create the ACL of this network we discover an internal domain which is not published we discover the email account the operating system which is running on... this is only from one single file so let's see what happens in the end when we analyze 4,000 documents so... is there any lawyer here that can tell us if downloading something from the web and looking inside is illegal in the US? no, in our country this is not a crime and remember, if something happened my name is... well remember, my name is Fermin well, this is Masile Defense Agency this is the people who is taking care about the Masiles they are publishing 1,000 documents and we can analyze one of these documents quickly so let's see what they have this is another Excel file drag and drop extract metadata and you can discover a user, an internal server the operating system the Windows Server and so on only from one single file so the question arises when you are sitting in front of us asking oh my god, how many files is my company publishing? are clean these documents? well, probably not because nobody is taking care about this and another big misconception is that only Microsoft Office documents store metadata and this is not true because all the Office files store metadata, hidden info and lost data so as we saw with the Tony Blair example with metadata you can get screwed up in many ways but we want to talk about fingerprinting network fingerprinting actually using metadata so this listing here is just a random search for open office documents we grep for some printers we could have grep for templates or some other files some other stuff inside the document but just looking at the printers let's see how many different stuff can we do here because what we are what we are doing here with metadata is actually using the information first thing that you can see is IPs again, this is downloaded from the web but the IPs that you have here are the internal IPs of the network you have here the example that this doesn't happen only to small companies also happen to big companies like Sun, Novel and then more stuff that you can see here is like different names that CIS admins are going to give to servers like, that's a second if you see there something like SRV2 there's probably an SRV1 so again it is started with metadata and from this data I'm going to use my tools to go get some more information there's another one there with his how many of you do that in your company? SRV1, SRV2, how many of you do that? some hands? there are another kind of system administrator show your hands if you have a Mordor server Darth Star, Luke, Leia, Han Solo, Chewaka show us words, don't be shy show us words if you do something like this no? okay so actually so doing this there is this technique which is called DNS prediction using Google and there is this guy who presented this paper is Johnny Long who said that you can go use Google Set which is a tool that as an input takes a term and as an output gives you a list of related terms so with a tool like this you can go and try to get other names so you go to the subdomain of the server that you want and then try other subdomains we included this functionality on the FOCA and then actually this is for the Sun Microsystem example there was a poland.sun.com and then FOCA went through this list and discovered Slovenia.sun.com so you can do stuff like that this is another sample of how starting with metadata you download this from the web and you see that in the metadata there is a username there's a URL there with a server name there's some more information about the PDF so you go take that piece of information make another Google search and then you get some funny message error messages and that stuff and you discover another server name you ping that server name this was from an American Express example and then you see that I mean you didn't know anything about the organization before maybe they are not even hosting the files inside the network they are hosting in a third party but you now know information about the network and you can actually ping it and access this information well in the end all kind of files store metadata you can extract metadata and hit an info from Microsoft Office documents of course but also from open office documents from PDF, from EPS from graphic documents looking into the XMP or XIF information and almost everything in this example this is a picture with GPH data but in an XIF in a picture with XIF information you can discover the device with which this picture has been taken or the data the software using this mobile to create the picture and even the data when the picture is taken this is quite funny because two months ago a girl sent me a picture and I said oh you look great and she said thank you and I said how old is this picture and she said to me oh only two months and the picture has been taken in 2006 it's true, it's true also when you have pictures embedded in office files you can access to this information if you are using FOCA all the images and look for the XIF information in this example we are going to use a file from Novel Novel is our friend and just drag and drop, extract metadata and look into the picture, the file and you can access to the XIF information and in this example you can see how this picture is copyrighted by Jonathan's story because for sure Novel has the license to use this picture in this presentation I guess but you can find metadata also in videos in this example is a video in which it's possible to discover the user name or the speaker or in a printed TST it's quite funny because almost everything, all the documents printed from notepad print in the footer the path from which this document was printed and in this case this is a real example with a girl, this is the user name so you can find metadata around you you can find metadata on the toilet after you use it this morning go ahead please so what can we found in metadata as I said a couple of times so far the point of our presentation is to get all this information to some fingerprinting of the network we are interested in users and getting who created the user, who created the document who modified the document there is going to be as we've seen before something in the path that is also the user name we have plenty of information about the users of the internal network on the documents we have operating systems we have printers both local printers, remote printers from the path we have a lot of information to internet servers, NetBio servers we know the protocols that they are using for printing and for sharing files we do things like, if something is like slash slash or backslash name of something slash whatever the tool is going to semantically analyze that path and that information to know what in this path might be a server and then try stuff as we said before database structures device information from pictures, all that stuff that we send how can you track this metadata while just looking into it most of the time it's either strings sometimes it's binary and binary you can just open an hexadecimal editor and then see the row hexadecimal you have a couple of examples there like xeditor or vintex or you can use some special tools that have been created so far you can use it as an interactive readers for images or libestractor which is a tool that will take office documents and read the information there or this other like this is a screenshot of libestractor it's taking a doc document and you can see the history of the document there are some users that stuff you can use metagoo field too which is a tool that is actually going to take the domain going to Google get all the information libestractor and then present you with that information this tool is being created for another Spanish guy so it's great and of course you don't need even a tool you can just use in Google because as Palaco told at the beginning of the presentation Google has its own metadata so we can discover an FBI user just looking into Google for something like this looking into TTL for the war documents and you can see the users in Medraniac, Callaway your user from the United Nations your user from the Scotland jar your user from the Caribbean area remember not American cars in Italy and of course your user from the White House no no you have to be kidding me no no it's true can we get the user from the White House? yes we can of course so the problem with the tools that existed they only extracted metadata and they didn't do it well they left a lot of metadata there that they didn't show they absolutely didn't look at the other kinds of information that we've been telling so far they didn't do anything with the information they just say this is the information this is what we found there so you do your stuff and doing that with 4000 files is a tough task so this is an example of how you can take Excel file you use the Libestractor tool and you get there like one couple of users the software that was used and that's it with R2 you get an internal email printer, some more stuff that wasn't there so we just created a tool that looks at absolutely every metadata that you can find another example because Libestractor is not very good either with XML files office documents are XML so there's almost anything there in this file being analyzed and a lot of information with the tool that we created this is another very good thing that our tool has the point is that the tool that I mentioned before is going to Google and it's going to download office documents and for that it's going to do the file type column whatever problem is that Google implementations for this is wrong because what they are actually looking at is the extension so if you have one servlet or PHP script or something that is reading the file from the disk to serve it the extension of the script is going to be .do or .php but the file itself is going to be PHP the thing is that Google knows it because if you look at the screen on the left side they know there is a PDF file or file type PDF they won't show you these kind of files so we decided to use both Bing and Google for doing the search if you do examples of this you will get like I think we have a screenshot you have 67 next if you see there just using Google you have 63 documents and you discover nine kinds of software same domain you use Bing and documents you discover ten kinds of software so you combine both you have much better results so joining all these stuff we got FOCA FOCA is a tool that is going to collect all the files from the from the internet it's going to load automatically it's going to extract all the metadata hidden info and lost data and after doing this you can create more or less a map of the network so let's see this working and we are going to do it with the missile defense agency for instance so this is quite simple the project name, the domain and the folder to store the files and once you do this you have your doc for office 2007 documents open office pdf and also work perfect documents because it was great and of course you want to customize your search joining more than one to mine you can type your comments here and if you are working on a network share you can drag and drop and all the files will be here so extract all metadata and the tool is going to analyze all this document if you are working on an internet with document management system such as SharePoint or whatever this tool could be very dangerous because that documents store a lot of information can be and anyone this tool can access to a lot of information well the tool is going to create a list with the users this is the users of the missile defense agency well you can extract, export all this user to a file which is good because you can create a dictionary to attack some login page or whatever the folders the folders with the internal servers software emails you can drop a line to them and of course you can join all this information and just click in analyze metadata the foca is going to cluster all this information and draw the map of the network more or less it's not exactly the map but more or less and you got the operating system this is the list of personal computers and in the end you got the servers and you can look for another servers this is the headquarters this enough you can do the same with the FBI for instance in this example we analyzed more than 3,000 documents and no it's not a crime it's public we'll read documents, public documents but in other ways for those of you that are penetration testers to get to something similar to this you have to first get access to the network and then you have to maybe find some SNMP server with a public key for community to this like users, printers all that stuff and this is just a Google search and a Bing search from home you got a lot of information, you can see which the information about this computer for instance the users and the document used to infer this computer and you can of course track in the documents and find out which software is using in this example eSquad I suppose they got the license and so on well, you can do the same with Nobel, Missiles and a lot of places well, it's enough, go back there is also an online version but it's only for one file but in this URL you can download this tool, FOCA is free so you can download it from this URL and start to play with it just today how to fix this problem with metadata as as your users to clean the documents before publishing them is what you have to do or get some low cost engineer to clean the documents, get some stuff there's a couple of solutions to go one by one cleaning your documents, if you're using Microsoft Office 2007 there is this kind of inspector so you go to the properties, run the inspector and then it will tell this thing here, this thing here this thing here, this shouldn't be there if you want to publish the document so do you want me to remove it? it's better if Microsoft Office 2007 is in Spanish and then you remove this information if you are using 2003 version you need to have this adding which is called remove hidden data and it's going to do pretty much the same the problem is that it's not doing it quite well, it is actually removing a lot of information but it's not removing all the information this is a cool example to see how metadata you cannot leave anything there in this case the OLE streams the Microsoft Office is using if you see the structure there, there is just two bytes of information with a small number of them and then we created the tool on the right, the table on the right to identify to map operating systems to version operating systems and software in cloud inversions to those two little numbers so even if you are running the inspector or the plugin, Microsoft Office you need to get to leave those bytes there, those little two bytes there and then we will just read those numbers and get the information so we have a demo for these two we are going to create a new project new project and we are going to use for instance this document this one is a binary office document and we are going to focus focus rules, extract metadata and we got a lot of information in this case this document is being created with Office XP from Windows XP we got two users from noble ok so now we are going to clean this document with Office 2007 before to do this we are going to create a copy and then we are going to use Office 2007 so we are going to inspect the document and Office 2007 is going to tell us that there are personal information storing it, hidden info so we are going to remove all the information inspect again and everything is clean so it's perfect so now we are going to save the document close it and drag and drop it to the foca so now in the foca we are going to analyze this document and as you can see now actually it's possible to discover that this document it's being manipulated with Windows Vista which is my operating system and with the Microsoft Office so it's good but it's not perfect that you can still recognize the operating system from the user the same happens with OpenOffice documents they include an option to clean the document and I'm not sure if I should say that they do it right or wrong because what they do I don't really understand I think that they remove the information that was there but then they put the information of the OpenOffice that you are using to remove the other information so I don't really get it so just go ahead demo of these two so we created this OO Metastructure tool which actually does it right this is free it's at Codeplex you can download it and use it so we have a demo of how the tool works and it is in Spanish because in Spanish it's better but you are welcome to translate it and to port it to something that is not Windows I'm here so now we are going to do the same but we are going to analyze this document this document is from Novel so first of all let's analyze the Metadata storing it with FOCA so extract Metadata and as you can see this document is a poem with a lot of information servers, user paths, mails and it's been created with Windows 32 using an OpenOffice 2.3 and that's all so now we are going to use the OpenOffice to clean all the personal information from this file to do this we are going to open the document open the document it has macros so now we are going to tools options security it's in Spanish but I'm translating options and there is an option which is select to remove all personal information when saving which we are going to do is just accept and save this document with another name copy ok so now we are going to analyze this new file with FOCA this one with FOCA drag and drop document is here extract Metadata and it's quite strange because now the information is that this document has been edited with Windows 32 and my machine is 64 bytes bit I don't know why but it's my OpenOffice version and of course this is my printer so it's a mix between the old information and the new information so in order to fix this we create MetaStrictor so drag and drop the file now Metadata delete Metadata and if we analyze this document right now with FOCA we are going to obtain nothing which is good so it's quite simple you really have to give this tool to your users and they will go through the 5,000 documents, remove everything ok that's all and you will fire it in one month more or less well you cannot trust your user and if you don't trust your user you can use a special tool this is a tool for IIS this tool is a plugin which is going to delay all Metadata from documents or sending to the client but the original remains the same so it's just to protect the files in your website and you can use another option after cleaning all the documents you have to back Google to delete all documents cache because store all the documents in its servers so first of all you have to clean your documents take it off from your website then go Google and using the webmaster tool say please Google delete this file because it has my printer server it has my user account please delete it and of course don't trust your users try to because if not you are going to be working on your system your platform it can be liners or windows or whatever and your user is going to pay attention or not it's a tough task to clean all documents but don't complain about your job because there are of course worse jobs in the world so working security is not that bad so don't complain and of course if any of you is wondering if this presentation has metadata of course it is full of metadata this is the metadata of this ppt and that's all thank you for your attention