 Good morning. Today I'll talk about reverse engineering binaries. So before we start any ideas or assess why or in what scenarios we need to reverse engineer our binary files. Anyone? Why do we want to reverse engineer our binary files? See what I want to do. These kind of things we do all the time. Yes, analyzing malware. Sometimes also we have a situation like something is not working right. We debug like you said and then we also want to fix it in the runtime so it helps to know how our binary file is. That's very much the foundation of hot patching basically. That's how we do. So we'll look at some of those things. Before I get into it, here I have a simple program called Hello World which simply prints Hello World and there is something called Hello World hijacked. This has been hijacked. So one thing could be someone just changed the string from Hello World to the string that we are getting on the screen. What are the ways that in which we have a function which is printing some string and we could have changed the control flow of the program and hijacked it completely to get something completely different that someone may want to do. So we'll try to look at those things. So starting with I'll try to talk a little bit about the binary file format so we get some idea about what it comprises of and understand what is there in which can help us to understand it better. So in Unisys system the standard file format is ELF which is executable and linkable file format and then we'll look at the binary file mapping into memory how the information that we have inside our binary file actually gets mapped into the memory when we run it and how to backtrack code flow from the binary file alone. When we don't have the source code how can we make out what are the things that are happening which function is getting called and in what order and what is happening inside. And patching binary file and then we have a binary file here in our example which says it has been hijacked. So how can we check the integrity of it? What are the things that we look for to test whether it's very much integrated and nothing has been think it in it to make it malformed or anything like that. So again dynamic integrity forensics. So there are two ways one is in static we have a binary file on our desk and it has been manipulated and when we run it certain things happen. One other thing is that you are running a program and at the run time someone injects something into it maybe through debugger or get control using EL. Like we have something called ptrace utility using which you can attach to the process in control of the process and then change things into it to do whatever you want to do. So how can we do that run time analysis? So talking about binary file format do we have this puppy? In binary file format we have two types of binary file. One is your executables that we run and then we have shared libraries which you see that when you do LDD on any of your files executables or shared library you see a lot of dependencies. So there are linkable file format and executable file formats. You cannot directly run your shared libraries file they are linkable files. So what the things that are listed here is pretty much there in all your binary files but the program header is an optional thing. This part you do not see in your linkable file you do not need it basically. Program header is something that decides that has a lot of information as in what is this text segment and where exactly what will be the addresses and lot more information which maps it into the memory at run time. So shared libraries are meant to support your executables and not run as an individual so you do not need this information. You might still have it but anyways you are not going to run it as an individual entity so it is not very helpful. Then at the top we have ELF header then we have program header which describes your file format and we look at it. And then in between we have all the sections where in all our code is there like all the functions everything that you write inside your high level code it gets converted into various sections. And then you have section header table which does not really contribute at the run time but it helps you in debugging if you have it. Like it has the information of what all sections are there at what addresses it is supposed to be what will I mean at desk what is the offset from the start of the file and at run time where it is how it is supposed to be laid out. So things like that so it helps you in debugging but it does not have any contribution at the run time. So pretty much this all the sections program header table is an ELF header has to be always there. So these are the things that you need to be there in your file to be able to run it. Since ELF header is the first thing that we have in our binary files this is the structure we have in our defined in the ELF specification. But if you say for example you have a test binary and you do ELF-H which will only list your header of ELF header. So here few things that are very important are version which tells you whether the binary is for little Indian system or big Indian system. It tells you about architecture as well here class ELF 64 bit so it is for 64 bit architecture. Then it tells you it is executable file suppose I was running it on a SO file it will tell you it is a shared library object. And entry point is something which is very important we would be looking at some of the examples like how it is important in the run time and things like that. Then start of the program header you can see it is 64 bit and size of this header is 64 bit. So immediately at the end of this ELF header program header starts so that is fixed in your binary files. But like for everything else all the sections and section headers there is no fixed position in your binary file. ELF header is always at the top of your binary and it is followed by program header if it is an executable file. So after that you have information like size of this header and then size of program header which is 56 bit and size of section header and then number of section headers. So size of section header is these many bytes and number of section headers is this that is there are 31 entries in your section header table. So this is a pretty much what we get there are 31 sections and program header and things like that. So we will move on to program header as in what all it contains. So first part is does not really have any significance then this part is very important it gives you information about what is the loader of your program. When you start running any executable it first before it even brings your actual binary into memory it sees in your binary what is the interpreter for this binary. You see the dynamic loader which comes with glibc is the interpreter for this it gets loaded into the memory mapped into memory and that is responsible for bringing everything that is required to run this program. That is here you have two load segments and you can see the convergence here it is read and execute. So this is your text segment where all the code that you have written is resides. So there are two segments basically which is very important text segment and data segment. Data segment is where everything that is in binary say global variables function pointers and also since reside. So this data segment will have permission read and write. So you both are load segments but based on the permission you can make out which is your text segment and which is your data segment. Data segment will have execute permission and data segment will have read and write permission. Text segment you do not give write permission so that people could not come and at the runtime change your code binary code. That is not supposed to be happened people can stinker it and do certain things but it is not allowed by default. Then there is dynamic section in dynamic section basically everything like we always call printf but we do not define printf. We are using glibc defined function calls. So those comes in the dynamic section all the dynamic references that you are making. So those get mapped here all the dynamic symbols. Then there is node section which is just informating. Then we have e h frame which is to do with exceptions that you write like exception handler and all that comes in e h frame. Then there is stack information then there is relocations information. Now think about these two are very important these three actually the interpreter and the load segments. Because these are what which aids to the actual execution at the runtime. So first this gets loaded and then this takes care of loading the text segment and data segment of the program. And then it see also here I do not have but we have a segment wherein it has the list of all the dependencies. Yes based on the relocation table entries and the dynamic symbols which is coming from outside like printf entry and everything that you have which is coming from the outside. It identifies what are the external shared library dependencies is tried to bring in everything into memory before it starts executing the first statement in your text segment. So that is how it maps and the two load segments you can see what all it tries to pull it into memory. So you can see this is the first second third. So third index is mapped to all these sections in your binary file and the second this is second text segment. So second maps to all these sections these are nothing but section names which you can look into your binary file. Now section header here we are just listing all the sections as you can see. Interp must be there somewhere. Yeah the first one this is your loaded program. So there are section for everything and the data segment and all the text segment are being listed here. And according to mapping it tries to map it into memory. So here like I said here there is a mistake there is nothing called BSS segment there is text segment and data segment there is called BSS section. And BSS section is part of data segment. If you can see in this slide the third one third one is what is our data segment based on read and write. And here you see then the BSS section is there data section is there. So it becomes part of data segment so there is nothing like BSS segment this is a mistake in this diagram. So what happens like we see that in the program header we have interpreter which first get loaded then we have text segment it pulls it. You have this virtual memory map of your process. So at the top it gives kernel space stack and then at the bottom it maps text segment of your program. All the sections that are listed in text it tries to maps it in this part of your memory. Then we have data segment and then we have heap initialized from this point. And stack grows downward from the upward and all the dependencies that you see in form of shared libraries you will see all of those maps in between these addresses. So that is how pretty much the address space of any program looks like. Now given a program like we saw some example where we had extracted section header section information program information using read ELF utility. So when we are trying to reverse engineering we given a binary file these are the tools which comes very handy. So NL helps us to list all the symbols that are being used in our program. So symbols are nothing but all the global variables and the functions that are used in program. Functions could be your local variables or anything that is coming from shared library files. So it gives you complete list of all the symbols in your program. And then if it's a C++ code you get mangled name which does not really make sense so you can demangle it using this utility. But if it's C code this is you have got nothing to do with it. Then object dump is what very helpful and read ELF is very helpful if you are trying to get the entire semantic of your ELF file. We saw some of the example and object dump if you take the dump it gives you the kind of converts everything in every section you have binary. It's a binary file that we are talking about so it converts it into disassembly everything that is there in each section and gives you the output. So you can check what is happening inside each of those functions just based on your binary file. So you do not really source code but you need to have to be able to look at the simle code for that. Maybe at this point I can just quickly show you some of the things. Python D on this test binary it gives me a list of all the symbols that are in there. Now if you see everything all the symbols are starting with underscore Z which is not giving me does not look like real name. So according to semantics of the code mangling function mangling what happens that it converts everything starting with underscore Z. So any moment if you see any name starting with underscore Z you can try to get the real name using C. You should assume that probably it's mangled up name and try to get any of these. Even if it's not mangled up and you just to provide it with C++ it will give you the same name if it's not a mangled name. So this is the actual name of the function that is being used in the program. And here it says undefined that means probably it is coming from some outer library and it has not been defined inside the program. So here you see there is init function I guess I missed to mention it in the section file. This is something very important underscore init. It is run even before main usually we think that the main is the first function that is being called. But we have a initializer and the initializer function which are which comes into place. So this is being run before it. So you can see all the functions and the assembly codes in there and you can if you can read assembly code probably you can. You know just create the entire control flow of your program using this. Read ELF which gives you if you say A it gives you complete output of the thing. So we have our ELF header then we have program header here. Then section header comes in this case. So we could see from start of program header 64 bytes. So ideally the program header should have been here. Probably the output, the weight output is not very informative. Here we have program header and everything is listed. And then at the end of the program header it also gives you a listing which one includes what all sections and then that follows different section. This is the dynamic symbol table. So here we have a lot of symbols being referenced and you can see that there is function or there is no type also given. So it tells you what it is. Whether it's global symbol or weak symbol, global probably means it's coming from somewhere outside here. Even you can see it's coming from some G-Lipsey library. So these tools you can use moving on. So knowing about how the ELF binary format is and what are the tools that we can use to extract information from it. What is it that we can do with it at runtime or change certain things even when the file is at the desk. We don't have the source code. How we can change the control flow or do things differently with it. So to do that you can patch your binary file. So the way you can do it, now to patch it you have to write your code and then you have to know how you can add it to a program that you have. So knowing that is a bit tricky. So we looked at our programs at text segment and data segment. So when you run a program by design there is a way it gets loaded into memory. There is a gap of one page between text segment and data segment. So when you are running your program at that point you can inject your code between text segment and data segment. But you might just overwrite the data segment so you have to fit within the size of one page according to what the page size is on your system. If it fits in you still have some work to do extend the size of the text segment in your header file and things like that to keep it very in an integral form. And not be a malformed file that when you run it it just throws you error. So there are a lot of things that it is not as simple as just putting it in that area. You have to do a lot of manipulations like shifting your header entry size of the section text segment whatever section it is going in. You will need to shift the size of that section. You will need to shift probably offset of some of the sections accordingly where you are putting and things like that. So there are various ways you can inject your code. One is between text segment and data segment. One is that you can put it before text segment and extend the starting point of your text segment and you can do it. But once we have put our code into the memory how we can change the control flow. We have it in place but how we change the control flow by default everything that is there in the text segment. It is defined which function is called when and when it returns value and things like that. So how we make our function that we injected in memory to be called. So there are various ways that you can do it. So in text segment you can do certain manipulations and in data segment you can certain manipulations to do. You can follow any of these rules to change your control flow. So the first is here procedure linkage table. We probably did not talk about how the relocations happen. So when we have functions called from some shared libraries. It is in the symbol tables it is a relocatable global symbol and the address that is there with it. You will see there is a relocation table and at the run time when it goes to that address it has an entry in the relocation table and relocation table points it further to the procedure linkage table. So for every function you call there is a procedure linkage table for that function outside. Which is coming from outside. So in the procedure linkage table the very first instruction is that it jumps to your global offset table. So in global offset table usually if you have come across that function once then you will have the address of the actual function. This will point to I did not show when if you look at you have a process running. If you do proc PID maps file if you open that you will see that all the executables are mapped in certain address range. Although like I said it checks what our dependencies are there and it maps all those shared libraries in certain range. So it gets the actual address of the function from that range of library and it updates it in the global offset table at the run time. But the very first time if say my program is making a reference to printf global offset table will have an entry which points back to the procedure linkage table and in the relocation table it sees what is the type of this symbol. And based on that if there are so many different types of relocation symbols based on the type I mean there are different type of competitions to calculate the address at run time and it updates the global offset table in the first home. And for further references it just finds the address in the global offset table and it jumps to that location and executes it that is how it happened. So in the procedure linkage table for any function that is coming from outside in the procedure linkage table it will instead of the first instruction being a jump instruction to the global offset table it will manipulate it to jump it to the code that we injected. First instruction in the procedure linkage table will be a jump instruction to the code that you injected that is how you will change the control flow. And the second way of doing it is that see for example you have a function written in your program anything like XYZ function and you know that at this point of time XYZ is going to be called and at this point I do not want it to execute what is written inside XYZ. So I change the first few instructions to jump to the function that I have injected. So that is called function trampolines you change the first few instruction of the function to redirect it to a function or position that you have injected. Ctross and Btross like we saw init and f init which is initializer and finalizer calls in your program which is the default things that comes with the system. So you can change the pointers there before it gets to the main you have a jump instruction in the init function which redirects to the code that you have injected. So you are changing the control flow by doing this kind of manipulation. Then you have data segment how you can do change things in data segment to change the control flow. That is we have like we talked about procedure linkage table and it has a jump instruction to global offset table but since global offset table gets updated with the actual address of the functions at run time it has write permissions and that is why it is put into the data segment of the program. So global offset table without even computing the actual address of the function that it should have referred to I am putting address of the function that I have injected in my program. Then we have global data and variables it could be a function pointer or some structure which has a pointer. So these two are pretty much same so you just change the pointers to your own code. So that is pretty much you do in text segment yes. So with randomized things change a little bit but by default if you do not have that randomizations enabled this is how you would do it. But when I mean we are trying to improve on security so if it is not the randomized addresses are not happening then you can pretty much you know exploit it this way. We have shared library injection we talked about injecting our own code in place somewhere between data segment and code segment. What if we don't want to inject it in the run time is there any easier way to inject. So I write a program and I compile it into a shared library and then somehow load it into the address space of this program which is running or which I am going to run. So one way to do is there is this variable environment variable ld preload using which you can preload any library you can specify a library and it's going to load it with the execute table and map it in its address space. So you don't really have anything injected between text segment and data segment but you have something in the address space a library loaded which has code that you want to execute. And you can then you have a get address range where this file gets loaded and then you can read that change the control flow using any of the methods that we talked here making changes like this. You can get control of your running program if you have a shared library wherein you have written the things that you want to get executed then you can get control of your program using things like ptrace so you can attach a ptrace to your running program and then use open or mmap to actually map this shared library in the address space of your program and that is how you get it in the address space of that process. Or you can dlopen using you have like I said you attach a ptrace get control of your program and then dlopen like but dlopen comes with libdl library so if you don't have already mapped you can libc is pretty much almost by default mapped with all the programs so you can use this function to do the same thing that dlopen does that is to map a shared library in the address range of the process then once we have these things done like we saw the methods that we are injecting a code between text segment and data segment that was pretty much statically we are changing things at disk we are changing things before we could even run and manipulating the headers and where offsets in a way that it still remains in integral form and run it and one was that we inject it as a shared object at run time so static integrity analysis where we will do when we have to check where we know that things have been changed before it is being run on the desk you have binary file that has already been manipulated or passed we can say so few things that we should check whether to validate whether it has been tinkered or changed entry point so we saw that every program has an entry point which is pretty much the usually in your text section you have all your code which you have written which you want to get executed so it will be within the range of text segment of your address space of your program so probably you can see here that there are certain libraries that are being loaded this is pretty huge probably we can do with something simpler so here you see we have libc library loaded and our binary are loaded and the first and there are three parts of it you see the same binary being mapped thrills so one is your text segment I say this part is text segment because it has execute permission here I have write permission so this becomes my data segment I based on permission I can pretty much identify what it is and this is my data segment and there is something called read-only segment of your program so that goes here so similarly for all the library dependencies that it might have you will see this one is pretty much useless it does not do anything in just maintaining a gap between these two segments so this is what it does that's what I know so this is your text segment this is your data segment and this is your read-only segment so in the text segment if I see my entry point for this binary is what I am running and this is my text segment and the address range is this if entry point is not within this range it creates a suspicion for me it makes me alarm that something has gone wrong someone has played around with it probably so it creates a suspicion so that is what we look for when we check entry point then we have initialization code integrity so I said there we have init function and in our finalizer function so again the same thing you see what are the if it is making any reference to anything outside your text segment so that creates an alarm again then we have PLP got hooks so this comes into picture for all the function references which is coming from shared libraries so relocation needs to happen so we check whether the first entry in the procedure linkage table for all these functions coming from outside their procedure linkage table the first instruction refers to entry in global offset table or not if no there is something suspicious and then we check the address if it has been updated in global offset whether it is again within the range of the library that is mapped and it is within the text segment of that library say for example I have printf function and I will check the entry for printf function I will check whether the procedure linkage table first instruction is pointing to global offset or not the address is in the global offset section or not if it is what is the address that is updated there I will get the address once you get the address again here I have the glibc mapped so I know my glibc is mapped and the text segment is within this address range and if the address is updated already so if it is not within this range someone has played around with it so that is how we track it you are and you it is very intuitive looking at the function name usually it says say if it is a put as function it would say put as add the date glibc and things like that so you will get an idea that it is probably coming from glibc function if you have worked with it even if you not you just follow around the addresses and you will get to know where it is coming from or if it is updated I will just try to look at the maps and see what address range it is falling and that is how you track and validate it we have DLL injections so for this one again this share this is dynamic linker injection I am not sure why I put it here anyways let's talk about this one reverse text padding is if you put your code injected code before the text segment then you have to start put your text segment pointer start of it to an address which is it is supposed to be so by default the start of your text segment starts at this address so if it is anything before it you know that you probably where it is it was supposed to start someone has put some code before that and has updated the starting point of your text segment then text segment padding when you put your code after the text segment so you have the we are talking about segments so in segment there are lot many sections so by the end of the text segment we have this section coming in wherein you are supposed to have all your check functions like your exception handler code so if your text segment is falling somewhere inside this exception handler then it is again such pieces for me we have dynamic integrity analysis which is like when at the run time I have injected some shared library files we have rock PID maps is what I check I see what are the libraries that are mapped in the actual address range of the program so we already saw here that we have these maps so usually we have lot many more dependencies so we can see usually you will see the standard it is coming from the standard path if you see some suspicious library being mapped here that raises an alarm for you then you have a leap reload on the stack you can see whether you came one thing is you could check the maps file if there was anything already mapped if not you can just attach gdb to your program and there is something called auxiliary vector so how loaded does it when you start running your program things that are done is like your interpreter program which was I showed you in the program header loader program gets loaded then it brings in your text segment data segment and all the dependencies get mapped and then if you then it initializes the stack and everything so in this stack before it create the first call frame it puts all the environment variables that were associated with the program and that's called auxiliary vector so how I can check it at runtime is like here in the maps file again if you have it running you have this stack address range so this is my stack grows downward so this is my higher address so this is where I'm starting from so auxiliary vector will be before this so I'll try to say I want to read 4000 bytes before my what was there on the auxiliary vector so I'll go backwards from this address reading inside the gdb what is there before this address and get it in the string format I can get the auxiliary vector of it okay the process is done now so here if you try to address of this I will just get the higher address range of my stack and see what is there I going to read 4000 convert the address into hexadecimal and then minus 4000 I'm going this much before you see what are the environment variables that are associated what's there in the auxiliary vector of it so you can pretty much use these facilities to what was being put at what environment variables were in use where it is trying to pick the binaries from there's so much that you can have a look at by doing this then so if there was any ldp reload that was being used you will see it on your stack vector is what I'm trying to get to so you will notice that there is a ldp reload here and then if it was loaded obviously in the proc.pid maps also you will see a mapping of that shared library then we have god address verification which we do like I said that at runtime we if there is a reference to printf and first time it has been called then the god table gets updated so I will see that I will go through the list of all the entries in the global offset table and see whether they are actually referring to the correct address ranges or not I mean the libraries that are mapped and then they are in the text segment of all those libraries and if the library is suspicious or not that is how I would validate it was pretty much I had any questions sorry okay so we talked about a lot of ways how we can patch files so shared library ones are pretty simple ones I told you just create a shared library and then you just all you have to just map it in the address range ldp reload was one you can write a program in which you just do p trace and attach it to your process and you do it so these are the manual ways for the injecting the code between text segment and code segment is what the trickier way that is when you are trying to manipulate the file binary file itself when it is on the desk so for those it's a lot of work and you might need tools and there are few I mean you can just search on internet there are few but we don't recommend and talk about them because mostly they are used for not so good purposes so yes you can search about them yes in that case we cannot attach p trace because you can have only one p trace attached to the process so to make their program more secure some people just already write their code in a way that p trace is already attached to the process to make it secure that people cannot inject any shared library using the attaching associating p trace mechanism to the process getting control and loading a library into that service range yes you can do that but again like ldp reload would be easier to track down for somebody like just look at the auxiliary vector and just look at the maps probably yeah but you can use it anything else thank you then for non c++ you don't have this thing called we use it for like sorry yeah yeah yeah and even if you use with any name that is visible to you in the symbol list it will either you will get the same output or if it's a mangled up name is sorry I had missed the term mangled up so if your function name is mangled up it will give you the actual name if it is the real name it will fetch you give you the same output so that's how you know whether it's mangled up or not but usually if it's mangled up it's underscore z is what it's starting that's what I have noticed