 So, let's see how loud you applaud at the end of these 20 minutes. It might be a bit different. Good evening everyone, my name is Pierre Marc. I'm a malware researcher with Antivirus Company E-Set. Our company is based in Europe. We have a big office in San Diego, and a fast-growing office in Montreal. I'm going to present today on basic Python for malware analysis. Please focus on the basic part of things. I am far from having the skills that you do. It doesn't work anyway. Can you guys hear me okay? So, yes, this is very very basic Python for malware analysis. So, I apologize for my lack of skills. I just hope that you guys can give me a couple of tips on how to improve what we are doing with Python. On a daily basis, the duty of our team in Montreal is to perform malware analysis. What we want to do is just to understand some malware. Understand what it's doing. In this process, we are using Python in different ways for different purposes. The first one of these is to process huge amounts of files. On a daily basis, we receive between 200,000 and 300,000 new files that we've never seen before. Most of them are either HTML or PE files. Already a question. What is malware? All right, malware stands for Malicious Software. And it's something that is bad for your computer. Most of it is now targeting Windows systems, but we are also seeing malware for other platforms such as Linux, OS X, and all these type of things. Most of what I will be talking about today is focused for Win32 binaries, but a lot of the tools you can use for other platforms as well. So, I apologize. This presentation is quite short, and I won't have time to dig deeply into what malware is, what reverse engineering is, or the tools we are using. I'm going to try to focus on Python. But if you have more questions, please feel free to see me after the presentation. I will be glad to answer any of your questions. So, we are using Python for batch processing, processing large amount of files, but we are also using it when debugging Windows files, for example, to defeat backers and obfuscation. And finally, also during our static analysis phase, where we want to understand what some piece of code is doing. When we are seeing some malware, the first thing we want to answer is this list of questions. So, we are presented with a file. It's usually Win32 BE file, and I want to go through this checklist as fast as possible without repeating what I have to do, keeping the repetition steps to as few as possible. So, when we have a new malware, we want to understand how it gets installed in the system. What's the infection vector? How did the system get compromised? Then we want to understand how the malware will persist in the system, how it will, for example, change a registry key to make sure that it starts again when the system reboots. We want to find if there is any stealth functionality inside the malware. Does this thing install any rootkit? Does it try to hide its file on the hard drive? And all these type of features that we are interested in because they will be important for doing forensic analysis or for detection purposes. But I want to stress out that the team we have in Montreal is not focusing on detection. We are really focusing on malware analysis. We don't want to detect the stuff. We want to understand what it is doing. We want to understand how a malware will communicate on the network. What type of communication it will have. Most malware nowadays will communicate with a command control server, sometimes with encryption. So we want to be able to analyze network traffic and understand what is going on. And finally, we want to find what is the payload of the malware. What was it created for? Is it created to steal information from an infected system? Or is it just created to send spam? A new trend in malware is also paper install where some guys will simply infect the computer to install more stuff by other people that will pay for this. So this is all I will say about malware and if you want more information, just let me know afterwards. So the first module I use in Python for batch processing is PE file. It was written by Ero Carrera. It's a great module. It allows you to parse a PE file and access all the different fields that are located inside the header. I use a PE file to validate PE files because I can see, for example, all the different sections of the PE file and if the size of these sections, when you add them up, is at least equal or lower to the whole size of the file. So if this size is bigger than the file you have, you might be dealing with corrupted files. A lot of our customers will send us files that are corrupted or that are truncated. You only have half of the file. It's not really malicious because you only have half of it. So the Windows loader will not load it in memory and it's not really malicious. So if I get 100,000 files, I want to go through them using PE file and just get the ones that are well-formed. And another one here. Right. So PE file stands for portable executable. This is the standard executable format for Windows. It's also used for a dynamically loaded library, DLLs. So it's a very standard file format that is being used under Windows. I'm sorry for you guys who don't like Windows, but this is the bulk of what we did. It's like a DLL. Exactly. Or Maco-Wonder-Wesson. So basically this library will allow us to parse the PE structure and validate it. So this little code snippet shows that I can try to match some signatures inside these files. Using PEID. Who here has heard about PEID? Oh yeah, so we have a couple of CISP groupies here that know about winter and through reversing. Alright, so that tells me that I can go a bit faster for this one. But PEID is a very common tool that is used by many reverse engineers to identify a packer. I will go into a bit more details about packers later. But using PE file, you can load a signature file from this and match it onto the different files you are dealing with to know if it is packed and what packer it uses. If you don't know what a packer is, you can try to see it as Russian dolls. A packer is a protection layer that you will apply to an executable. This is used very often to protect an executable to make sure that people like Gabriel won't reverse them and understand what's going on. Well malware authors often use packers because they don't want us to know what's going on. They don't want us to analyze the file. So the purpose, the first step you have to do if you want to analyze some malicious file is to go through these Russian dolls, these layers and reach the core which is a small black thing here. You want to go through all the protection layers and reach the core which is the original executable that will let you understand how the file works, what the program does because a malware is just another program you want to analyze. And once you have access to this then you have more chances of understanding what's going on. The packers will obfuscate code and they will compress it but they also include lots of different tricks that will make analysis harder. For example they will try to detect if the file is running inside a debugger. So at least a couple guys here should know what a debugger is. I think it's okay. A debugger I often use is called Immunity Debugger. It was released by a company called Immunity Security and it's just an enhanced debugger with a Python interpreter on it. So you can use Python to automate lots of the tasks that you are doing. The stuff that I do all the time I want to make sure that I don't do that because they are simple. They are easy to understand and you don't need them. I want to focus on the part that is hard so for example many malware will try to check if the file is being run under a debugger and how it does that is that it will check the process environment block which is just a memory structure used in windows and the first field in the PEB process environment block is a flag. If the flag is at one it tells you that the debugger is present and if there is a zero it tells you that the debugger is not present. So the malware before starting is not being debugged and if so it will stop or it will throw you in some crazy parts that don't make sense. So there is a script that I did not write but that was written by the guy at Immunity that is called hideDebug in Python that will change different parts of memory to make the executable believe it is not being debugged and this snippet just shows that one of the first thing it does is to alter the memory and the process environment block change the flag that might be at one because the process being debugged and change it to zero. What you have to remember of this is simply that this tool exists it's called hideDebug inside Immunity Debugger and it helps you to automate lots of these basic steps that you would take any time you want to start debugging something instead of having to click a couple times you just run this script when you start your program and it will help you out or automate the basic steps that I'm doing. You can use Python to also help you understand code when we receive a malware it's an executable and it's all assembly. We have to understand what's going on and of course the guys don't usually ship with comments and strings and these type of things so we have to figure it out by ourselves. The first example I have is a quick Python script that I did to help myself understanding a malicious software for Switzer. Switzer, once you are able to go through the packer and get inside the binary as a routine that obfuscates its strings it's using a simple XOR algorithm where it will XOR all the characters inside one string and the string is never all the strings in the binary are not decrypted at once they are only decrypted before they use it and once the string has been used in the code they will just remove it from memory so you cannot just let the software run at one point and have the whole thing decrypted in memory. So I use Python Well the first step is to understand this listing of assembly where you have this small XOR routine and reprogram it in Python so that you can run it inside Ida and it will give you all the strings that are located inside the binary. So you can run it by hand but it's quite long and tedious and it's useless once you understand the thing once you just write a couple lines of very bad Python and you save some time so in this case in Ida Python you have a complete interface that lets you play with memory play with the bytes of the program and modify them so what my script does is it will go through all the segments of a PE file of some structure and memory it can be some instruction but it can also be some data so what I will do is check if it's not code because I want to decrypt strings and then I will validate if it looks like a string meaning that there is a list of characters and once I validate that this is a list of characters I will call my string the code routine which is a bit lower here I will share these slides with you guys because I know there's a bit of code and it's probably not the best for you to understand here but I just want you to get the general idea of how we can automate some of the processes for malware analysis so the decoding routine is exactly the opposite of what was implemented inside the malware where we will just explore all the characters inside the string with the key that comes hard coded in the binary these binary have to be standalone they come with everything so most of the time even if they have crypto it's quite weak because they need to embed the key that comes with it to find the key so what this code will do is de-exhore the string and at the end patch the bytes inside the database for the clean version and also make some comments so that it gets easier for you to read if you are not familiar with IDA it's a disassembler that will just show you assembly listings and you can add comments so when you have a comment to align it looks like this and when you have a call to a function where you have cross references the second example I have here is for a pier fry which is another family of malware they also have some tricks that makes it harder to understand where all the strings are also encoded but they are stored in an array of strings which is located in a separate area of memory and instead of referring directly to these strings they will take a pointer to the beginning of a string array and then add the offset of the string they want to use and then use it for a program so that means that the first thing you are presented with is the left part of this screen without the comments so I wrote a quick script that would help us or help me identify areas where a string would be used and then a quick comment and a cross reference so that I can understand what's going on and have a bit more information that will help me understand the malware you have to remember my second slide where the only thing I want to do here is understand what this thing does and what kind of things it might be doing so for example here when we look at the code we find out that they are using drive infected string and it's actually some kind of comment inside the binary that says okay I just infected a tongue drive so the script that I did for this part is quite similar to the first one you've seen for swizer but in this case instead of looking for strings that are used inside the binary I'm looking for this series of instructions to find the pointer to the beginning of the area of strings and then you add something so the code is here is look forward to the instruction in the assembly if it's a push alright then look to the previous one if it's an add that means that you were shuffling through the list the area of strings then you use this reference and you make a comment and add a cross reference to it so I see all these dead fish eyes so I will move away from the assembly but please feel free to come and see me if you have more questions about this yes please one question I know that your specific your examples are specific to windows but how can what kind of tools do you have to explore these things in other ways so for example IDA is not free software but it is available for OSX, Linux and Windows and the assembly is actually the same because it's a 6A assembly so these things were found on windows systems but they could be found on other systems as well it's just that at this point in time there is a lot more malware and a bit more advance on the windows platform because up to now the majority of people are using windows and this is why malware is starting again but I mean two weeks ago I was analyzing some malware for OSX and it's the same process and it's the same thing we want to do we want to automate everything we find out to make sure that we don't repeat ourselves another question as soon as you see a packer isn't that an indication that something fishy is going on I would have thought so but actually it's interesting to see that the most advanced packers or software protection are not using malware does anybody know where they can be used in what industries? games games, Blizzard doesn't want you to crack their games so many many software are using software protection I was surprised to see that even Google at one point was shipping out applications that were packed with some weird packers that were not really custom so we cannot assume that if something is packed it is malicious although it happens I have four more slides we should do another thing we want to do is that once we have reverse engineer some malware once we have understood how it communicates for example by looking at how it interacts and what it's trying to do well we can write some quick Python script this one is using a DPKT to extract some packets from a network capture decaps and find all the different data streams that are using HTTP or TCP port 80 and then if you are able to understand the malware you can find out that it's using some interesting crypto in this case which is Kialijos it's using Glowfish and then compressing its data using Zlib and then encrypting it using Glowfish so once again having Python with all these libraries like PyCrypto is very useful because I don't have to re-implement all these algorithms I just use Python and within half an hour I have a script that allows me to decrypt network communication so thank you guys, Python is cool there are many other things that I cannot fit into this presentation two of them are tools to automate WinDBG which is another debugger for Windows that is quite popular probably the most powerful but maybe too powerful for me because I'm not really an expert at it but PyDBGX and PyKD are two extensions for WinDBG in Python and then there's also PyHU which is some Python interpreter that you can use with HU which is a very popular example editor to conclude this very short presentation malware reverse engineering or every reverse engineering task is often very repetitive you are dealing with small assembly listings, you are dealing with lots of stuff you need to understand so automating it is quite important and this is where Python is useful and of course remember who we are working for so if you don't want to work in Bratislava Koshitsa and Krakow come to Montreal anybody has suggestions? I think our official title is the researcher didn't they put senior in front of your name? oh yeah any more questions? how much of your work is actually done in batch? right well to be very honest most of my work is now emails our team is really focusing on in-depth analysis so we have lots of automated systems we are collecting lots of files and we are not going through them at all not through all of them so when they reach our team we go by end that means that our automated tools failed and now we need to understand what is going on so we at least need a step of manual processing but inside the rest of our company there is a lot of automation and we need it yeah I guess it's like some sort of triad exactly so there is a lot of triad going on and once we identify something that is either completely new or something that is really breaking our tools well we will have to go through many of them just a quick note there are packers like UPX which are just simple executable compressors that can actually speed execution time of your application at least for loading updates there are many excellent legitimate uses of those types of packers