 So my session is called Snakey PDF. Thanks for your time. So a little bit about myself. My name is Mahmoud Ibi-Raman, I'm Yomu from Mysore, part of cybersecurity Malaysia. Interested in malicious PDF, as well as Android malware revisiting. So the agenda for today, I'm going to cover a little bit about intro, and then a little bit about obfuscation technique that we found in the world that have been used. And of course, a little bit about challenge and issues. And the last one is about conclusion. All right, introduction. So this is not about PDF reader application zero date, so don't get confused with that. And of course, it's also not about Adobe Axe Reader Sandbox Bypass. So it's plainly we're talking about PDF obfuscation, which is pretty much the same image, right? Gamma Flash, obfuscation, whatever you want, right? So basically, when you talk about obfuscation, so you have a PDF plus obfuscation, what pretty much you expect to become. Transformer, maybe? Not really, right? Of course, you expect to become a PDF as well, right? So the topic for today is about you have a normal PDF, or you have a PDF that we exploit, and then you want to obfuscate it. The main reason of why you want to obfuscate it? Well, obviously, there are many reasons for that. Most of PDF mainly come with a two-component. The first one is probably the exploit, and the second one is the payload itself, or shellcode reader. And then, well, the main reason, like I mentioned, is need to be of the radar. How? Obviously, you need to look like a normal PDF. You cannot be, you cannot expect to look like a malformed, because if you trigger that, then you somehow raise a rack flag for any security product for detection. Of course, what do you want to obfuscate? It's dependent on what kind of level that you want to go. Normally, people just obfuscate on the exploit, or people want to obfuscate on the shellcode. But the two most these main component will need to be obfuscated, right? Well, while you're obfuscating this one, obviously, you want to stay on a stealth, reducing for being detected. And one other thing is try to make analysis harder, because eventually, if you were here for the previous presentation about APT, there is always, you know, we still believe somehow in the end of the day, your tool is going to be detected, right? But you can make analysis become more harder by implementing some obfuscation. So statistic of obfuscated analysis, PDF, I don't think statistically somehow look this way. It's always going up, because that is pretty much a trend. So yeah. And then the PDF in general. So I hope the PDF is serious enough nowadays. So this is pretty much the PDF. I tried to cover a little bit about the structures or the concept of PDF. The first one is you see on the green color box, you're going to have PDF version. And on the PDF version, normally you have something like percent something, percent something. And then you have PDF and then version numbers. One of the components that can be used to fool a couple of PDF analyzer tools is by play around with the version numbers. So not version numbers by the header or PDF. Like for example, if you are looking like percent and PDF dash one and what kind of version is it? You need to remember, though, because for the PDF specification, it says within 1024 bytes, for the first 1024 bytes, it's legitimate for you to declare the PDF version. For example, you have some weird numbers like you can have percent, percent, percent, or you have ABC, ABC, or whatever you want. And then you put percent PDF 1.1. It still can be considered as a PDF, it still can be considered as a legitimate PDF version. And then the second body for this one, if you can see over here, it's pretty much the main component of PDF, which is normally half object, normally being representing as an OBJ. It needs to be N with N-O-B-J. Let's now show some rather obfuscation that you can implement when it comes to the object declaration, like this one. And then, of course, it's also part of the major company inside here is the JavaScript. And you have in the yellow box, over here is the cross-referring, which is somehow defined offset of each object. And then the last one, you have trailer. Trailer is pretty much the main component that have one major component called slash root. Slash root will especially give you some clue or some entry point where your PDF will start when you open the PDF. And the last one is also interesting. You have N-O-File, which is representing of N-O-File for the PDF. Declare the percent N-O-File, but it also can be, you can also have multiple N-O-File. I will show a demo later on on this specific topic. So obfuscation, well, I commonly observe in the wild obfuscation method, pretty much like I mentioned earlier, you have JavaScript. And you have a PDF syntax. And of course, you have PDF features. And the last one is probably you have media reach on PDF. OK, the first one, abusing the JavaScript to make your PDF undetectable or make your PDF become more difficult to analyze is disguise the limit. Whatever can be executed by a JavaScript is pretty much can be rendered as well within a PDF engine. And majority of PDF reader applications have a JavaScript engine. You name it, like Adobe have it, a Nitro have it. You also have FOSIC. And even you have Sumatra PDF as well. OK, basically it's mainly been used for runtime export setup, like you're not going to put plain text for your exploit. We need a PDF. So you pretty much construct during runtime for your exploit to trigger. And then, of course, you're talking about shellcode authentication setup. Normally, you generate shellcode on the fly to make your PDF obfuscated, I think. And, of course, majority of the PDF that we saw implementation of the JavaScript is mainly for heat spray. But recently, if you notice that it's like a couple of deep spray stuff also have been implemented as well. And, of course, the main reason is to make analysis. One of the main reasons I'll try to make analysis harder. Well, how does it look like? Pretty much, if you notice over here, you have y, a, this is a very typical JavaScript operation. You have y, a, y, b. And then you have y, c, blah, blah, blah. And then you combine all this stuff and you execute it. And then another typical JavaScript function that been used is argument.cali. And another thing that's normally been used is ASCII conversion. Like you have 0x41 is basically representing 4f8. For example, in building the PDF, you have a dictionary called slash filter. And slash filter is going to retrieve some key value, which is our flag decode. And instead of representing as a flag decode, you can also spelling or using an ASCII conversion. For example, you take a look at this sample over here. So it pretty much starts with 50, a, r, e, and so on and so forth. And of course, we see a usual suspect for encoder, like basic to 4, 13, and so on and so forth. And normally, for JavaScript application, they try to make one-liner code, because JavaScript always ends the code execution by the semicolon. So the new line really doesn't bother. So normally, they try to scrub the formatting of the code by implementing just one line of code. And then sometimes it also has some spaghetti code. If you take a look at the sample here, you pretty much notice this is just a declaration of function. And the last one is probably tried and catch loop, right? But this is pretty much a try loop. But isn't it? Oh, OK. Sorry. Over here. So you should notice so I can make it bigger. I don't think you guys can see it, right? Can you guys see it? But somehow over here is kind of the loop that you can see loop for try and catch. But really, this is not the education part, because it started earlier on before that. Like, if you notice here, there is an if-else function. So this is pretty much a very generic JavaScript affiscation. It's not really a affiscation. It's kind of like spaghetti. You combine multiple code and function within. And then sometimes they also come with a half pyramid sometimes. If you notice here, it's kind of like a running bunch of hip-spree stuff. So sometimes even you can make pyramid. Probably the guy's boring. So try to make it look like pyramid. And then dealing with the JavaScript affiscation can be annoying. Normally, this is pretty much you always end up. It's always like, because there are many ways people can fool you with the JavaScript. Like I say previously, JavaScript affiscation involves creativity. So if the bad guys are much creative, then an analyst. So you always end up like this way, fuck if something. But I believe with the help of two, you will be probably more relaxed a little bit. So you can release. Because there is a tool for you to help you to automate or make your analysis more easier. For example, if you're talking about to run the JavaScript, you need to have some sort of emulator for that. Because you try to trace one by your backhand while I wish you good luck to manage to go through all the code. And for emulator, you have a spider monkey, rhino, or even you can try Google V8. And of course, like I mentioned earlier about the code spaghetti stuff, you can use a code beautifier, like a JS beautifier, which is pretty much an import to almost IDE or code editor. And you can also implement some sort of a JavaScript dynamic instrumentation tool as well to see the affiscation, to see the runtime of a JavaScript. But then having said so, it's pretty much take a look at the previous sample of the JavaScript. You pretty much can read all this stuff, right? But maybe in the future, it may come within this function. This is valid JavaScript implementation. And then it may also come with the smiley characters for you guys. So this is kind of affiscation that you can see within the JavaScript implementation in PDF. And there's a second number of affiscation that we saw is about abusing the PDF syntax. For example, if you notice the slash root over here, which as I mentioned earlier on, within the startup of the PDF, normally going to start with the slash root. At the beginning of it, need to go to root. And this one is specified, the object slash, the starting point of this PDF always going to refer to the object 21. It will be referred by the object 21. So any education path will start on the object 21. It's no longer about object 1, 0, OBJ. Because the normal thing that we saw, the mistake that we've done, is when you start working with analysis on the PDF, you start by looking at the percent PDF version. And you go with one by one of the PDF component. But really, you need to start from the slash root. And another thing is about referring to another object. For example, if you notice the var here, the variable var, is appointing to the function called, another function is appointing to object called this.info.title, which is the kind of OO representation, which is this normally referred to the object itself, sorry, referred to the PDF itself. And .info is always referred to another component, like for example, this one, slash info. And slash info is a reference to the object 5. So this is the object 5. And then this is your title. So your variable of root will be this stream. That's pretty much of the discussion that we saw on the PDF. OK, I talk about the syntax of the percent PDF and n of 5. And normally, how you start passing your PDF file? If you're building some PDF tools to analyze a PDF, which one you start for? You're looking at the percent PDF version, or you're looking for n of 5? According to the Adobe reference, you somehow should start from the n file of the PDF. But the problem is you choose to read from n of 5. Which one is going to be the first n of 5? Because I mentioned earlier, you can have multiple n of 5 within your single file of PDF. For example, this one, if you start from n of 5, if you're passing your PDF file using n of 5 as a starting point, which n of 5 that you want to refer? You want to refer to this one? Or you want to refer to this one? Well, this is pretty much legitimate declaration. If you open, it's not a mile-form PDF. Your reader, majority of readers, well, I have ties with the Neutro, Adobe, ties with Foxxik. And even events, they open nicely. This is not even a problem with this. So if you build a parser, which one you choose? And then, for example, if you choose the last one, if you choose the last one, and then how about this? If you choose the last one on the n of 5 and tickle at the object 12, if you choose the last one, this one, this one. So in this n of 5, you have multiple object declaration for the object 12. For example, you have one, you have two here. So which one, if you start from here, which data now is representing by the n of 5? Sorry, we've been representing by the object 12. Because, for example, the first object 12 now here is half a data that says test first n of 5. The second one is test n of 5, second n of 5. I will show you a demo about this later on. And another one here. OK, but the truth is n of 5 is not even needed. It's not even needed. If there is no n of 5, your PDF application reader can somehow manage to render it properly. So the last one of declaration, for example, on this one, if you notice, from this one, the object 12, the last object 12 is always going to be the one that will be picked up by your PDF application. So like I said, this one declaration for object 12. So this, the last object 12, will always be executed. OK, so this is the one. So the last one will always be used by your PDF reader. So for example, the demo. So the first one, let me open with this. If you open this file, it's freaking slow. For example, this one, this is your first n of 5. But you can put a bunch of other things as well below it. Like this is object 12. You end with another n of 5. And this is interesting. You have object 12. And inside the object 12, you have another n of 5. And then this is pretty much the data that belongs to object 12. It's not easier the second n of 5. And you have the third one, n of 5. So if you open this file, it's always slow when it comes to demo. So you just open n of 5. For the second one, right? Because the last one is always the one that will be used by the PDF reader. So this one, I open with this one again. This one, it don't even have the n of 5. But if you open it with any PDF reader, everything is just fine. For example, this one, below this, that is no longer n of 5. But you have object 12 here. And you also have object 12 here. And then if you just open this, you can still get n of 5. So very tricky one to find where you can stop. When you're building the parser, it's very tricky where you want to find a starting point and try to pass for every single object within the PDF. For example, in future, I believe for the PDF, have something like this. For example, the reference to the object that creates loop. For example, you have the object 1 that have the dictionary for auto, which is pointing to object 6 over here. And object c, auto have also pointing to the object 1. If you open it with any PDF application, it may stop. But if you're building your own parser, when are you going to stop your loop? So that would be interesting to see when we should stop. Then another thing about abusing the PDF syntax is related to the acrobat, specifically designed for the PDF engine, especially for Adobe. A couple of JavaScript functions, like get n not, sing n not scan, and get page number, for example, get a word. Now, for example, get page n word. This one, basically try to get a word on page specific. And get page n word 2, 1. It's basically get the third word of page 1, for example. And then you can select based on page as well. So this is pretty much normally being implemented with the application on the PDF. For example, this is one of the functions. So you can see this one, like this, get page number, and this select page number is something. So it's very difficult to emulate all of the acrobat JS engine to your own parser. Because this kind of JavaScript code is not supported by default for public acrobat PDF engine. Sorry, JavaScript engine like SpiderMonkey V8 has not been designed for this. Another thing is, like I mentioned earlier, you can abuse incomplete syntax. For example, like object stream. For example, this is object as 1. We start with a 1, 0, OBG. So you pretty much expecting it need to be n with OBG. But then if you don't have an n OBG, that's no problem. You have a PDF application, we just load it fine. But if you build your own parser, you need to know where to stop. For example, this is stream. You're expecting to have n stream. But if you don't have, it's also fine. And even funnier, you have stream. You have another stream. And you have n stream. You have n stream here. So which data that belong to this stream, for example? This one? Or this one? Or this one? Or this one? It's very tricky to analyze all of it. So another thing is, like I mentioned, the problem with these are difficult to automate within a parser. Something like I mentioned earlier, emulating acro.js code, finding stopping loop, and finding ending tack for n OBG n stream. And of course, need to understand how the PDF reader application parser actually work. But then, are we building another PDF reader application? I think not, right? We are building the parser to analyze the malicious PDF. I put it here. We're implementing, we organized a challenge for HANA 9 project last year. So a couple of tricks being implemented over there is you're interested to see the trick to try to get along with the challenge. And seems very difficult to automate this analysis. So manual tools are recommended. A manual tool is something like you don't really parse the whole thing. You just somehow grab any object, and then you just display the object. So you can manually trace the component that belongs to the object. For example, you can follow the PDF flow. You can manually inspect each object or each stream so you know where it should end. And dealing with the acrobat with the pain in the ass of things. A couple of two that you can play around with the PDF stream dumper. PDF is a minor. PDF is a deceptor and file insect. Another thing that you can abuse within the PDF application is try to abuse in the filter. The main category for here is I put PDF features. Obviously, one of the normal features in PDF is our filter. Well, the main reason for a filter is basically try to compress the size of PDF. So you normally use some encoding to compress it. So if you're using less known filter, for example, CCITF-flex decode and DCTD decode, it's pretty much a highly chance for your malicious PDF not being detected. But somehow in general filter, it's normally you have flag decode, ASCII-hack decode, and a couple of bunch of ASCII, like ASCII-85, JBIG-2 decode. And this is also interesting as well. Because if you see it at the filter dictionary, it can accept a couple of parameters. So if you want to make your filter have some strict parameter so that if you're using like, if you try to incompress without passing this parameter, your decompression will be filled. You can implement it within the parameter called decode palm. For example, this code implement CCITF-flex decode. It's normally for image, but really it doesn't really matter. It's not necessary image. You can put whatever you want as long as it's within the format of the CCITF-flex decode. And then you can also abuse the filter by implementing some multiple layer of filter. For example, you have a combining of ASCII-85 decode and with the LZWD decode. So you just combine between it. Well, there is no limit for you to play around with it. So you can add a multiple filter on this one. For example, this one implement flag decode and going to do ASCII-hack decode later on. And you can also do some abbreviation, especially on this one. Like ASCII-85 is belong to ASCII-85 decode, LZW is belong to this one, LZWD decode. So this kind of appreciation somehow if you don't put it within your password, it will be difficult for you to decompress all these filter. And another interesting part of this try to abuse encrypt. Encrypt is something that when you have PDF, secretive PDF, you normally try to put a password protected on this PDF. And you put that on your PDF. So pretty much your string and your string going to be encrypted. So in Adobe, they're implementing RC4 and AS algorithm. So for example, on this one, this is not when I mentioned about encrypt, this is not really about protecting the document itself. It's rather protecting the string. So for example, slash root have dictionary encrypt. So it will point to the object 76 over here. So if you took object 76 here, it has declaration is going to be encrypt with the AS encryption. And then OK. So with this one, if you ask a password or something, normally when you open PDF, it will trigger a password for that. Well, we have a sample. If depending on the slash root where it starts, where you're busy keeping the password, it already can drop some exploit that triggered the exploitation already. And then, of course, it somehow is dependent on the event, like auth event. Like for example, you have dot open. So it will trigger where you open the PDF. So but then it's just slash root have pointing that points to different object that have exploit on it. So it will trigger that first. And then we ask the action from this event. Well, it's pretty much difficult to analyze it because simply the fact that you come down the string because you need to break the encryption out of it. So there is no plain text. So for example, a demo, a quick one. For example, I tried to open this with a file inside first to show that this file that have encryption, like I mentioned, please go to your slash root first. A flesh normally have within the trailer. So here, you have a definition for encrypt, which is points to the object 29. But your slash root here have the object pointing to object 10 first. And then this encrypt have pointing to object 29, I think. 29. So if you go to the object 29, so this is pretty much the implementation of our C algorithm for the encryption. So we have some stream over here. For example, take a look here. This stream have implemented some sort of flag decode filter. So pretty much you try to decode this one using a zipline encoding issue work. For example, you select this one. So you just select. And then you right click. So you can go to decode and try to inflate. Of course, you cannot do inflate, right? Because it's already been encrypted. And then if you round this using a PDF application, the exploit might trigger. But then yet, you couldn't find what kind of exploit I used for this one, right? We take a while for exploit to work because I'm running some sort of heat spray on top of it. So we take a while. I think we go to this screen after a couple of slides. And then the fourth one that we normally see within the PDF application is try to use our media reach on the PDF reader. Because the fact that they have many media type that they support. So pretty much you can embed a couple of media, for example, AIEF.move.mp3. And even now, that SWF. So if you add such an exploit within the PDF file against this media player, well, you normally win. Because people are not really suspicious to click on PDF rather than clicking on the mp3.move.move that come over your email or something. And normally, for the PDF, you are against the flash today. So you have normal PDF, good-looking PDF. It's not malformed. But you have bugs on Adobe and the flash. You get on as well. How does it look like within your PDF structures? For example, you start with a slash root again. You check slash root is pointing to object 31. And then object 31 have the page pointing to page object 1. So you go to the object 1 here, and then you have page. And page somehow have a reference on the object 12. You go to the object 12, and object 12 is somehow have a flash file on top of it. You see, SWF is basically compressed file. And then this is another one called FWS is uncompressed file. PDF is just pretty much the main transporter for this one. You need to extract the flash file to check what is the exploit, what kind of shell code inside it. So tools normally you can use like PDFStreamDumper and plus the PDF examiner for this one. And then, of course, you can do a flash analysis later on to trace what really happened when you opened the PDF. As we know, the PDF have engine for the flash. So for the flash analysis, it's pretty much very ideal for us to decompile and flash by code, which is normally tried from the AS3 or AS2 to AS2 or AS3 code. Idea, but so far, I saw it's only two software that do a proper decompiling on this one, but it's a commercial. So you can somehow integrate this kind of tool within your parser. So shooting flash decompiler for this one. And if you look for the PDFStreamDumper, they have another proprietary tool as well called AS3's accessory on this one. So this is pretty much the decompiling code looked like on the AS3. It's pretty much AS3 code here. And if you cannot do a decompiling, you can always do some buy code analysis here. So for example, this one is pretty much tried to construct a knob slack for the shell code. You can see over here that push some stuff, push integer. And this is a couple of loops over here to assign the array. And this is another pushing some ROP gadget on top of the shell code. And then pretty much here, you can also put some shell code and try to translate it and try to do encrypt it. So when it comes within the PDF, you just have a properly encrypted, not see encrypted properly obfuscated flash file within your normal PDF. Let's see for this one. I think this is a pretty much when you open the PDF, the demo just now showed that when you open the PDF, it just gives you a collater and then you crash. But then again, we don't even know what is the exploit that used for this one. So this is pretty much, this is not the main focus of this presentation. I just showed here basically how you can do analysis on the flash exploit that embedded within the PDF. So this is a challenge on the issue. I would like to share with you guys and hope there is a proper answer for a couple of these challenge. I would say building a complete PDF parser is a task. It's a very tough task. Having a laxical analysis is one thing. But the not so standard or parsing object is another. And embedded companies also need attention, like phone. How many exploit that are leveraged on phone that been used? Recently, you know on iOS, Jill break me three. It's also starting from the phone as well. And of course, media, flash, audio, and stuff is also main problem as well now to address within the PDF parser or PDF analyzer. And of course, it's a try to fully compatible JavaScript to emulate this is very painful, especially that has a specific on acroGS function. And inspection of the whole part of potential education within the PDF reader application contact is hard. For example, if you try to analyze outside of contact of PDF reader contact, it would be difficult to find out the flow of it. For example, you have app version, viewer version nine. We have a different execution. So you need to mimic all this environment within your analyzer. Well, even though they have a challenge and issue, I think a lot of effort have been done to translate within the tools. Like you have PDF miner, or you have origami that can be used within the library on Ruby. And of course, you have a famous digital stuff. And then PDF stream damper, PDF disactor, which is a commercial, I think that one. Yeah, and then it's also a couple of our online PDF analyzer as well. The one that we built is called GALUS. Then I think we have our famed JSON pack and PDF examiner. And the last one is also built by our previous presenter, but PDF analyzer called EPTGizer. So still, all these applications that have been built, it's not a fully 100% are comparable. With how the PDF reader application works. So it's very tricky. And then the conclusion that I can draw from my presentation for today is that many of the execution methods to hide analysis PDF, like you can do our JavaScript, you can do a PDF syntax, you can abuse some features. And for the features, it's also lead to many problems on this one. Like you have slash filter, which gives you some sort of independent to do some encoding. And you also have something like slash launch to launch wherever command that you want. Well, it's nice, it's convenient, but it also can lead to a different problem as well. And you also have like slash encrypt, which is just a demo. And again, encryption is a major stumble block. Good to protect your secret data. And then it's also so dusts for the zero day protection. You can protect a zero day within this encrypt function. More complex technique on exploitation. We make analysis more difficult, I think. For example, comax PDF from iOS to lead to the kernel exploitation. Of course, there is a kernel as well on this, right? And of course, I don't know. Big analysis PDF embedded file scanner or detector engine within the reader application itself is a good idea, because I believe it would be good because you have a contacts within the PDF reader, OK? But for some reason, nothing is also a bad idea. If you add another engine on top of it, you add a complexity of code base, which is also meant to do bugs within the engine itself. It would be embarrassed to have the code to do a scanner from malicious stuff. It's also vulnerable for the bugs, all right? So pretty much that's the conclusion. And if you're interested in this PDF weirdness, you can read all this PDF reference from Adobe, this one. And of course, you can also read... Well, they have a different reference on both of these. This one is specifically for JavaScript scripting engine. So you can read on this as well. And of course, a couple of researchers have been working on this, like Sebastian Faust. I've been working on this, and our social... Julia Rufi has also been working on this as well. And then thanks for all these people that have been working on this, I feel. And then I think that is all from my presentation for today.