 All right, now that I told everyone all the secret stuff before the stream started. Oh, oh, oh, shoot. I think I just have it here, let me check. Wait, that didn't work? That's what I wanted to stop. Let's see, where was it? Oh, there we go. All right, you can all thank Connor for this great image of building a web server, although the end was a fixed version, so it's okay. It's close, it's close enough. Y'all did pretty good, I will say, started late. Many of you, I guess it's not a judgment call, but you will only know internally if that was a good or a bad decision, so. 44% in all of it, and like 50 people got A's, or 50% of the people got A's, so that's pretty good. Good job, I'm proud of you. All my screen share broke, what's going on? Is it working? Twitch, hello, Twitch. Is it working now? Yeah, does somebody say yes? Okay, I'm watching it. This delay is so stupid. Okay, we see. Okay, all right, and now we are going to put all those skills you've been developing. Do you feel like you've been developing skills in this class? It's definitely still frozen. It's still frozen? I thought I saw it move. Oh, it's loading, but it's not. Yeah, oh, but it's not. Dude, I thought do this reshare, but this thing is so stupid. Software, can one of you all just go fix OBS so it works and doesn't have all these insane problems? That'd be great, how about that? Boom, updating, I see it now, great. Okay, people online, so you can see the, this was the grades that I guess you didn't get to see, but the grade distribution, of course, I would never show any of your grades because that would be a violation of your problem. Yeah, it's pretty good. Yeah, yeah, and this is C, so these people still pass too, I guess. Okay, back to my joke about skills. Do you feel like you've been getting skills in the class? What types of skills? x86 assembly, you're going to deeply love x86 assembly. You're enjoying understanding exactly how the CPU executes programs. You feel like you have a mastery over, if I just keep talking like this, will that actually, will you believe me that you do feel this way, or maybe somewhere deep down you do feel this way? Now, cool, okay, so I regret to inform you that many of you after you graduate will probably not be writing assembly code. Why is that? No, no. Yeah, because it's not as easy as Python or Java, why is it not as easy? Yeah, it's low level, right? You have to deal with moving the data literally from memory into a register and then computing on it. It's simple. I hope you've maybe come to appreciate that, at least. And the fact that it is not that complicated, I think if you probably looked at the programs that you wrote, you wrote a whole web server, probably using less than 10 different assembly instructions. Maybe some forms of moving and, you know, you're moving, comparing things, doing addition, probably two or three-ish of those. So honestly, like, if you think about it, what you used in the universe of all possible assembly instructions was pretty small, yet you were able to do something super cool, right? Create a web server that literally, the tools you were using in module one could talk to that web server. So, going from there, why do we do this to understand what's going on under the covers? And that's really what we're moving to now is how do we apply that knowledge, that this hard-won knowledge that you gain from analyzing and writing assembly code to understand programs? So, have you, like, so I was actually just talking about OBS, the streaming software, and I was saying that it's super annoying. So I said one of you could go fix that bug that I'm having, whatever bug happens here. Why did I say that? It's open source. You could go to the source code. You could find it, download it. What if you wanted to fix a bug or identify a bug in Windows? That is a midget. What was that? That is a midget. Yeah, so the, is the source code available? What are some other things, software that you use that the source code is not available? Canvas? Ooh, that's, I hate Canvas. Anyways, other things? Windows? Windows, what other things? Nobody plays games in this class? People are, like, sharing their gaming setups. Do you frequently have access to the source code of the games that you play? What games do you play that you have source code to? Minecraft? Yeah, there's a good reason for that. Does the book being a Genesis game count? Genesis games, yeah, consoles. So like, go back, it's like, you get the raw, like, the raw data, just reverse it, reverse it by night. Yeah, so we'll get to there, but what we're really focused on here will mean by, I mean, reverse engineering as a concept is quite large, here we're interested in understanding software and specifically, like, the software that we get, we do not get the source code, but we want to identify and understand the behavior of that software system ultimately as we'll see to find bugs. Because if you don't know what something does, you can't find a bug in that system. So, of course, if we're thinking, anytime somebody tells you, I don't know, a thing of reverse engineering, you should probably ask yourself, well, what is forward engineering if I'm going backwards for something? What's the forward thing? So, what's the forward software engineering process? You design the software first and then what? You have to write the code, so you write code that implements it, and then you actually then build it, so if you're using a language like C or C++, which we'll focus on pretty much here, that will then generate a binary that is actually runnable and executable by your CPUs. You have written these binaries and this assembly, so you are no stranger to this process. There's a lot of steps here in this process. It could look different based on how your process goes, but essentially, you have some high level design. At this stage in your careers, you're maybe not doing a ton of design, right? Who's doing the design? Yeah, the whoever's running the course, right? The assignment description is your design and you have to implement something that matches that. As you go further and further, you will be expected to design systems or work with people to collaboratively do designs. We have to design something, like what should the system do? And then you got coded up and then you compile it and then there will obviously be tons of bugs. I think it was the first time for me was probably like three or four years ago. Is that pre-pandemic or post-pandemic? Probably post-pandemic, maybe 2018, 2009? 19, when I wrote one Python script, I was like 100 lines, no, probably like 40 or 50 lines of code and I ran it and it did exactly what I wanted it to do and didn't, I think, have any bugs and it shocked me and I looked at that code so closely because if it worked, like how could there possibly not be any bugs? It was insane. So you will have bugs and if you don't, you should be highly skeptical of what's going on and not trust the results until you prove to yourself that it actually does work. So you have a lot of bugs, you'll compile it, you'll have more bugs, you will compile again, you'll change header files and compile. Finally, that code will be compiled and then kind of hidden in that compilation step is the assembly process and out at the end we get a file that is executable that a company can ship to your computer to run software on your machine. So, if that's forward engineering, then what's reverse engineering? Yeah, conceptually, what is it? Yeah, we wanna go backwards up this process. So starting, and you can actually have different, there can be different types of reverse engineering, you may be given code and you want to extract and learn the design of the code or the high level intention of what the developer was thinking for the code. But for our purposes, you'll be given a binary and you need to understand what does this binary do, how does it work? So you're doing reverse engineering up to try to understand what's the likely C code that was written that created this binary up to what's the purpose of this? What is it trying to do? And what makes this so difficult as we'll see is that every single step of this process, this forward engineering process, information is lost. What do I mean by that? If there's any sort of comments in the code, just so I can see it. Yeah, do comments help you understand the code? If they're correct and not outdated, right? I know a lot of you I saw a code last night and you're right on your web server, you hopefully followed my example and it had comments in your assembly code about what the sys call was, what you were trying to do, and that helped you understand and think about it. But once you compile your code, all those comments are gone. All that information, that knowledge about that exists in the source code is no longer in the binary that gets generated. What about the design to code? Does that lose any information? What's that statement that those do? Yeah, think about if you looked at, maybe take in, where you guys at? Are you taking 340 yet, or anything like that? 310? Some of you? Yeah, so if you looked at any of your source code or do you have programming classes on your first year, like 100 level class or two level class? Go look at your C file and try to understand what was I trying to do in this program. Try to guess what the assignment description was based on your code. It's actually difficult because a lot of your assumptions are baked into the code. The code doesn't necessarily say, oh, we wanna do this, we wanna do A, we wanna do B, and this is why. So literally at every single step, information is lost, and we'll see examples of this. Oh, we talked about this, right? What's lost on the translation between assignment code? The intention, like you will experience, hopefully you'll think back to this class maybe, and you'll think when you're looking at code, who is the idiot that wrote this terrible code? What were they possibly thinking? And then you realize, oh no, it was me, like a year or two ago, this literally happens all the time, it's terrible. So that intentionality of what you as the developer intended is lost and not always present in the code. Guys, hey. And as we talked about, things like comments are gone. Other things, how many of you had a very handy, nice variable names in your assembly code? Do you like that nice file descriptor register you could use? Or that red bytes register you could use? You didn't use that feature? No, because it doesn't exist, right? Assembly code does not have variable names, all you have is registers, right? So the compiler when it spits out a binary does not keep, oh, this variable is called I, this variable called FD, this variable is called the red size. Function names even as we'll see. The CPU doesn't care, right? You had labels in your code, but what were those labels actually doing? What did the assembler do with your labels? Turns them into sometimes addresses, but also offsets, right? So byte offsets, literally how far to jump forward or backwards. Structures, so if you have any type of complex structures like we saw, when you were using the bind, the socket, the sock adder struct, right? Those are bytes in memory. Where the exact boundaries are gets completely obliterated. And sometimes things will get completely removed based on optimizations. Compilers are incredibly smart. They can do things like, oh, actually I know that this branch is always gonna be false or true. And so I'm just not even gonna include that code in the compiled binary. And so the reverse engineering process is actually getting this stuff back, which is crazy. So let's compile a little program and check it out. So I will use Pone College desktop. I like the building a web server one. We'll just go here. Doesn't really matter. What was that? Can y'all see this or should I, in the back or can I, should I include, increase the font size? Quick, what does this say? I don't say that. There we go, boom, boom. Okay, cool. So, oh, fine, let's use Vim. Okay, I need a main function. I'm not gonna have it take any arguments. Return zero. Okay, let's make it slightly fancier. I'm gonna have a character buffer called buff, a buff that has 10, 24 characters. I am then going to read from scant standard input. So with scanf, scanf, percent, 10, 23, s. So that will read in 10, 23 characters into a string. I need to pass a buff, which is if I pass address of buff, but we'll see. I do not remember off the top of my head what, how can I find out what library function implements scanf? What do I always say? Read the man page, yes. Boom, standardio.h. Oh, I don't want scanf. That's reading from a string. I just want scanf. Cool, so let's copy this and control shift v. Paste. All right, good. I think that's right. Just reading though I wanna, let's do a printf, hello world, percent s. New line, that'll good. So should this program compile? Maybe, no one knows. All right, what's going on here? Okay, standardio.h should be good. Right, quit. GCC, it's our show, example, format, not a string literal and no format arguments. Ah, because I used scanf, thank you. Good call. Which we just talked about. Baby's first program, good, cool. All right, so what is actually going on here? So the compiler, so GCC is actually, is taking my C code. It actually has several different phases and you can get outputs in each phase if you can have C code, it will output actually the assembly that it's going to assemble using AS, the tool that you literally used in the previous module and uses LD to create a loadable. Finally, we end up with a ELF file and we'll look at exactly what that is. Just a file format is all that ELF means, I guess it's a cute name. I don't know exactly why that is the name but I'm sure somebody could look that up. It's 64 bit, important. So no, it's x8664, so that's the specific assembly that in there. We'll look at maybe other things later, but for now. So object dump is a great tool. This is, I'll say this a few times of like back in my day, this is one of those things, back in my day we learned cybersecurity and we just used object dump. So we were reverse engineering binaries using very primitive tools but why is this primitive? So let's dash capital M Intel. All right, so what object dump is, is it takes an object file, an ELF file and when you use the dash capital D tries to disassemble every byte in the program so we can see different sections. Does this look familiar of having like a dot something? Where'd you see this before? Close it. In the assembler syntax, when you're writing assembly which type of dot sections did you have? Dot text, which was the code and what else? Data, yeah. There's all kinds of stuff but that was the only ones we needed there. We can kind of just scroll through. One of the keys to reverse engineering that I will tell you now. If you want, can you go through and understand exactly what every single byte here means in this binary? Yeah, if you had enough time you can understand every possible thing about this binary, every possible thing about an ELF header. The class may be over by that point so you have to get pretty good at reverse engineering of kind of like seeing a broad picture by kind of scrolling through things and seeing what they look like and then sometimes diving in to fill in those details where it's super important. One story you should keep in mind is that one of the DEF CON CTFs that I was playing in I spent like, I don't know, three or four hours reverse engineering this function. I went super deep and I was figuring out exactly what this function has. Okay, yeah, this one's taking input, great. Oh, this one's doing some crazy math operations. I looked up those math operations and I'm like, oh, this is crypto stuff. Super cool, it must be a crypto challenge. So I'm figuring out the crypto operations. I have it all written out and then by the time I get done I realize, oh, this is the back door that the organizers put in so that they could drop flags on the system and there should be no bugs in there. If there's bugs in there, everything has a bug. So I literally wasted all that time on a part of the program. That was not relevant to my goal of finding bugs in this program. So I went too deep. I tried to like figure out everything instead of popping back up and going, okay, if I think of it as a picture I've only explored one little piece of it. I need to explore more of them to get a broad idea of what's going on. So these are tricks when you're doing reverse engineering. So this is just me scrolling through here. Stuff, stuff, stuff. We'll get to the main part in a second. I promise, I actually don't know how long this takes but luckily it's kind of like the matrix. You can see bytes scrolling by. PLT, we'll definitely talk about that. Very cool, okay. And the place that we're actually interested in is disassembly of section dot start, or sorry, dot text, which is exactly like I said what you wrote in your assembly code, right? So mapping exactly what you were doing there. You were telling the assembler I want the text section which is normally where code goes. So this is, if we looked at it and we'll look at it in a second. This is actually the very first thing that gets called underscore start. Did we define a function called underscore start in our C code? No. So this is stuff that the compiler and the loader put in so that you can actually like load up the proper libraries that we're using. We'll talk about those in a second. But we did write, and if we keep going we'll see other things, detours. Aha, we finally get to something called main. Does that look correct? Did we do write that code? Yes, what did our main do? It allocated a buffer and then what would that buffer? Read in input using scanf into the buffer and then printed that out with a string. Cool. So, we can actually, and all these functions follow the x8664 calling convention. What specific calling convention? Yeah, who's that for it? Somebody just said, I still, oh, there, you? Is this up for your voices? This is just like randomly, okay, thank you. System five, yes. Great, the system five calling convention, right? We're specifying exactly which arguments go into which registers. So we can see that main first has this NBR64 which we can talk about later but that kind of defines the start of a function in some sense. Then we can actually walk through each of these instructions and guess what? You've been spending now, I think it's like three and a half weeks, four weeks on learning assembly. So this should be second nature of reading this assembly code, right? Pushing RBP onto the stack, moving RSP into RBP, subtracting hex 41 from RSP, moving this thing that's in this FS segment, which we didn't talk about but moving something into RAX. Again, see, like I said, if I wanted to, I can look up and figure out exactly what this is. I know, because I've seen this before. But for you, you should be like, oh, that's weird and just kind of note it in your head is weird because I'm trying to understand this function. Then move RAX into RBP minus eight, XOR EAX with itself. What happens when you XOR a register with itself? Zeroes it out and even though we're dealing with the 32 bit EAX register, what happens to the upper 32 bits of RAX also gets zeroed out, right? Because of the special nature of dealing with the 32 bit registers. Anyways, we then load effective address. So we calculate what is RBP minus hex 41, put that into RAX, whatever. We could keep going through here. Eventually we can see we call something and we can see that the object dump is able to tell that this is some kind of scanf function, right? It's not exactly one to one what we call scanf. We'll look at later what this actually is and then other things happen and then we call printf. So it's actually at least from just what things get called this maps to our seco, right? Our seco at first called scanf and then called printf and then return. So we have a call to scanf, a call to printf, stuff and then a return. We can also, ooh, good question online. We'll talk about that in a second. They're asking about function names. So one of the cool things we can look at and this is literally, so there's several different reverse engineering tools. One of the best ones which I already mentioned is file. It literally will read the headers and important information of a file to tell you what information it knows about it. It's a very simple tool, but it's great. You can run file on challenge run. It will tell you that this is a Python script. It'll tell you that this is an elf 64 bit file. Let's see server. Yeah, so this, we can see actually the difference here. This is an elf 64 bit LSB, at least significant bit executable, x8664 statically linked, not script. So using all the, we can use file. The other thing that's super useful is the strings command. If we look at man strings, it is again another very simple tool that literally goes through and looks for, I think by default, four, yeah, at least four characters long. So it looks that there's four ASCII characters in a row. And if there is, prints that until it ends at a zero. Not that. So there's actually like a lot of strings in here. So what strings were from our code? What other useful information might there be in here? Would that UC, yeah, in the back? Say it again? Printf, yeah, we can see, oh, this function, this program probably uses printf. And we can even see here it uses this ISOC99 scanf, which probably there's different versions of scanf. So it's using the specific one that we asked GCC to use. What else? The size and which way? So which thing on the screen shows that? Yeah, so we have our constant string that we had to scanf. So if we're looking at this, we say, oh, I saw the call to scanf. Oh, I saw a string that kind of looks like a scanf, something that you passed a scanf that reads 10, 23 characters. Probably something I'd want to check of, like what size can I pass there? That's probably gonna be an argument here. We also have the string hello world percent s. We can also see that GCC added junk about what version of Ubuntu we use to compile this and specifically what version of GCC that could be useful for us. Yeah, all kinds of stuff, right? All the function names. There's even, I believe we should have main in here. Or libc start main. Yeah, so we have our function name main is in here, right? Libc start main is a different one, but our main function was in here. So all this information is contained in our program, but oftentimes, oh, I gotta go here. Oftentimes, this actually is including too much information. We also didn't even show this is not compiled with debug symbols. Anybody know how to compile with debug symbols? Hyphen G, and let's just look at this difference. So now we can actually see there's a pass in here. There's kind of the exact, this looks like the exact GCC options that were used to compile this program. There's my directory where I compiled this of home hacker. That's kind of crazy. Why is my directory in there? There was, let's look at the name is not in here. Anyways, but oftentimes we don't get that, right? We don't have debug symbols in our binaries. And even worse, there's actually a very trivial command that any developer can run for any L file called strip, which deletes any symbol and data from object files that aren't useful. So very often when you have a closed source binary or closed source application, the symbols will be stripped like this so that you can't easily get data. So let's run that on our, and now we'll look at our strings. So I guess we still see these two strings. Why do we still see those two strings? In our program, this strip command should not remove any of these strings because those are r's, right? It would break the functionality of the application. But let's look at it in object dump. So I'm gonna space through and actually let's get to .text. Okay, there's .text. Oh wow. So how come I don't see my main anymore? My main guy's gone. Close, so it's the strip program to remove the symbol. So there's something in the L file that tells object dump, hey, main is at this offset. And that's where the names are. But like you all know, the compiler doesn't need any of that. The compiler doesn't care. It only cares about memory registers, right? So let's see, if I, I am, yeah, so I think based on scanf, oh no, that's not right. Okay, there we go. So we know there was a call to scanf and we know there was a call to printf and we knew the first instruction was this nbr64. So this is me making an educated guess that this is our main function. We can see that object dump had literally no idea that this was the case. And we can verify this. There is no main string in our function. So now somebody who's trying to understand this has no idea of what the function names we use were or the variable names or anything like that. So yeah, this is kind of a, I mean, we covered a lot of this but this is like high level what we were talking about. We first, so we were disassembling. So we used object dump to disassemble and go one step back from the binary to show us the assembly. This is usually a very good and truthful process because as we saw, right, you guys wrote assembly that got compiled directly to binary and going back and forth is actually not that difficult because you're using the registers and everything. There are tools that we will look at to decompile code into C or a C-like representation. So tools that can take assembly code and produce okay-ish looking C. As we go further up, you can think of your building on a shaky foundation, specifically for decompilers. They are often wrong. So this is why you can't just like, this is why we actually have you study assembly code for four and a half weeks, five weeks. I don't know if time has no meaning anymore but after you're done metcalling and currently, right, we spent so much time looking at reverse engineering or looking at assembly because oftentimes the decompiler will be wrong and you will have to look at the assembly code to actually understand what's going on. And then you do a lot of thinking. So based on that, you're like, how is this program likely written? What is this for? Because you have no function names. So one of the things we'll see, part of the reverse engineering process to help you think about the code and the system is to actually give names to things. Things like variables. What is this variable called? How is it used in this function? If I was writing the C code, how would I call that? What is this actually doing? There's like a lot of pointer operations. Is it sorting things? Is it iterating over a list trying to find a specific element? A lot of this is not very complicated, but it's just stuff that you have to think about that. And then you kind of get this click and you understand exactly how the program works, such that you can do whatever it is you want to do. You can find bugs in the code. You can, as you'll be doing in the module, is to break checks of DRMs, like license key checks and stuff like that. Questions? Oh, and somebody mentioned it, so I guess I'll briefly mention it here. So this is kind of the, that's not it. This is basically the, using strip is kind of like the baseline of what you get in terms of what developers can do to prevent you from understanding their code from the binary. But there's all kinds of crazy stuff that they can do in terms of obfuscation, like those strings that we saw. There are programs that will automatically go through and encrypt, or usually not quote encrypt, but will modify the values in memory so you can't see them. Like they'll XOR them with a constant value and then when they use them, they will decrypt them with using XOR on that key. All kinds of crazy stuff that can be done here. Cool, okay. But now to understand how to reverse engineer, we need to actually learn more about the forward engineering process. And this is stuff that you've, we've covered a little bit like calling conventions and we've touched on the stack, but we need to actually like understand what makes up a program so we can understand that when we're reverse engineering these things. So program can have many different modules, each composed of functions that contain basic blocks or blocks as we'll talk about here that consist of assembly instructions. So you've been writing assembly instructions. You've actually implicitly been doing kind of some of this other stuff and fundamentally operates on variables and data structures. So let's, so how come you didn't use a, if you remember when we did assembly and you're writing a web server, how come you didn't use the printf syscall to write out like the header and then data or something like that? There is no printf syscall. I maybe already made that reference, but thank you for remembering. So when we call printf, it's not the operating system that is doing it. It's literally another library. So let's look at that real quick for our example because we had that. So, let's see, I can use, Redelf is actually a super complex program that will show you a ton, like has a ton of information about L files and will show you every possible information there. Redelf dash A, it shows you all the magic values, everything that you needed in here. What I want to show is the libraries. So there's a, so this is the loader, LD. This is what does the dynamic loading. And then it should say, yeah. So this specific section here defines what kind of libraries it needs. So our program saying, hey, I need libc.so.6. And the loader at runtime, when those functions are called, will actually load that from the file system. And that, yeah, there we go. So let's do something, example. Okay, so I'm running example in one terminal. PSAUX grep example. Okay, one of the other nice things we can use to figure out about a program is the proc file system. So slash proc is a way the operating system gives you a view into what's going on on the system in terms of a file system. There's a crazy amount of stuff in here, but you can explore at your leisure. But I know you go proc slash the process ID and then we can look in there. This is all the information about that specific process. Cool things are like CWD, the current working directory is a link there. What executable it is being executed. And we can do, I think this is maps. We can also look at exactly how memory is laid out for this process. So we can see we have the binary example. So it is being mapped at this memory address. And the way to kind of read this is, these are permissions. So R means you can read the memory. If there's a W, that means you can write the memory. And if there's an X, it means you can execute the memory. So this is how some protections are done, but this would be the text section. So this is readable and executable. That's what's our code is happening, but we'll get into more details here. I just wanna show it off here. But we can see that there's all this other stuff. So there's the loader. So the loader is actually in our program's address space. And the loader has loaded this libc version that is running. So this is what the loader has to do is when our program calls printf, or calls scanf, because that will happen first, it goes, oh shoot, you need libc. Let me load that from the file system, put it in your memory space, and then make it so that when you call scanf, you're actually going into this area here into some code section that implements that. This is how libraries are all done. Let's show one example really quickly. We don't have to do it. Oh, that's not what I want. We don't have to necessarily do it like this. If we use the dash static option, what this does is it just shoves in all of libc into our program itself. So if the executable is very large, let's see LSSLA example. Okay, we can see this is h dash h is a nice option to do human readable. So this is 15 K bytes. If we compile this with dash static, run it, almost a megabyte, 979 K. And then when we run it, we should be able to, no, 1192. Oh, that's not there, that's this one, 3480. So now we can see a whole lot of less junk, right? The only thing that's in here is basically our program and some other places, like the stack is located here, the heap's here, right? That's because everything that was loaded is now boom shoved into our binary. Okay, so when it comes to reverse engineering, if we see a call to something like scanf, should we open up LD.so and start reverse engineering what scanf does? No, what should we do? Yeah, read the documentation, read the manual, right? If somebody else has already documented how this thing should work, let's not dig into it unless we absolutely have to. So that's at the module level, right? We can have different libraries. Then we define functions and this is most of, I don't know if this is a good thing or a bad thing, but most of the reverse engineering tools are targeted towards understanding a specific function. And so usually you reverse engineer a function in order to understand what it does because that function does not have a name. So you don't know what that function does. And then you can see by giving that a name, can you see more of what's going on? Okay, first thing to look at, let's go back here. Okay, so we have our nice program. Let's say, oops, main int argc, pointer pointer argv. Okay, let's say if argc is an integer, if it's two, wait, so this is the number of arguments, the file name is always one. So if it's two, then let's say printf, hello world arg, %s, buff, %s, okay, otherwise. So if you were writing this in assembly, what's this assembly code gonna look like for this if statement, roughly? Yeah, it jumps, compares, jumps, right? That's what we expect to see. Okay, we can jump in here, so we can look. So compare rbp-hex414 with two. Two was the value that we specified in there, that makes sense, that's a good check. Jump if not equal, so if it's not equal, jump to this at 7a or at 1203, which will be down here, and then it'll start executing. So, assuming we start executing at main, where will we stop executing main at? Or where can, so control flow is just gonna go, the CPU registers just gonna be like, boom, execute this, and then afterwards this, and this, and this, and this, and this. What happens when it gets to call? Goes somewhere else, right? It goes to 1090, and then when it comes back, it compares and then jumps, and then goes to 1203. So you can think of these like a single unit, like a block of code that will always execute. So these instructions here, now getting more precise, if you include that call or not is up to the tools, but we can think of it as maybe up here, down to the jne is one block. And then there's after this jne, what are the two possible places that the code can then execute? 11d8, the fall through if it's equal, and then if it's not equal, it'll be 1203, right? So we can actually visually represent this as a graph, so we can put all the basic blocks and then have edges between them, depending on those conditions. ObjectDump will not do that. You need to think about that and do that with ObjectDump, but there are other tools that will do this. I'll show IDA, which is the, for better or for worse, it's a leading industry reverse engineering tool. It's been around for a long, long time and has a lot of development done on it. I think it should be here for most of you. I don't know why it's not here on mine, but it's definitely on your system somewhere. Mine's under development, IDA freeware. You can actually run, there are like the full, fully formed version of IDA costs like $10,000 for a year subscription. Yes, they have a free version that's I think one or two versions behind that only works on x86 that you can download and use. And then Yana's worked it out with the developers to give you guys access to this through the dojo. Yeah, pretty nice business if you can do it. Yes, to access it through the dojo, you have to do it in desktop. All of the other tools I see up there. Yeah, we'll get there. Getting too fast for me. Okay, so in here I'm gonna choose example. We said question mark, oh there we go. Okay, and sorry, I realized this is kind of small. I don't know how to zoom in better here, but there's like a ton of options. I would highly recommend, like you can fiddle with this because you can literally just throw a blob that you got from somewhere and it will do its best. You can tell it how to do things. For many things it's very good, especially for this, it's very good at figuring out what you kind of want to do. So you can just use the default options. It will then go try to figure out, and you can actually even see in the lower left the graph that it generates. And let's see, you kind of want to wait just in case it's for a while, the initial auto analysis has been finished. This is your key that like, okay, it's done, it's kind of figured everything out. It by default puts you in main. So if we look, this is exact, sorry, this is exactly the same code, nvr pushrb rsprb minus 420 hex from RSP. And then we had our two, compare two with RBP, but we had, let's compare that actually. So we have the JNE, jump, jump, jump, jump, jump. Why can't I see, oh, here it is. The JNE and the compare with RBP minus hex 414. If we look back, we see it's actually not, it says var 414. So what's it doing that? So what it's doing is it's actually realizing, oh, any offset of RBP in this function is to a local variable. So it gives that variable a name so that you can give it a name. And if I hit the magic key on all these things is F5, if you hit the magic thing of F5, it will decompile. And so we can see main, we can see that it actually has figured out that there isn't some, now why didn't it call this buff? It says car v4, yeah, because it's stupid, or not, it's because it's stupid, it doesn't know, right? That information literally disappeared in the binary. That data is not there. And then it has this unsigned int v5. So looking at, we see the scan F, we see our string, we see v4. We see if argc is two, then printf hello world, blah, blah, blah, pass in argv1 v4, otherwise printf hello world, that. So this is part of as we're, I guess I didn't strip this one, but you can use, there's a lot of like keyboard shortcuts I would highly recommend you use. There's also like the right click menu, but you can rename the variable to buffer. And now that propagates everywhere. So now you can see exactly what that was. It's actually a bad example because it's got it so correct. The other cool thing and the cool trick to use is tab. So if you're on the assembly or the decompiled view, you hit tab, it switches you over to the other one and highlights whatever was highlighted there. So if you want to know, hey, where's this printf happening? You hit tab and it's point you here. So this is exactly the whole point of showing this to show this as a graph. So we have this block, then this JNZ, and then there was a red arrow and a green arrow depending if that's false or true. And while decompiling it turned that into an if else loop. Kind of cool, right? Cool, okay. But functions do a lot of things as we saw. So actually let me show you the two other tools real quick. Just that'll help you as you go about these things. The other tool that I'll show next is the one that actually we make here at UCSB in our lab. So anger is a binary analysis platform and we've built a decompiler into anger. This is all open source, you can run this locally. I will say the thing I added was this color theme that I'm very happy with. I think it's the Dracula theme. Yeah, it's great. But, so it's all written in Python. You can actually go look at all the code for all of this stuff. And we're working as part of our research to make this better. But we can do that. Let's, I hope it doesn't barf on this, but. So similar window we get of what this is doing. Again, thing, run analysis. So exact same window here of the assembly code, the JNE to this, and the functions are all here that we can look at. If we do F5, we should see the decompiled view. So we see, oh, that's bad. Why didn't it figure out what these were? Anyways, okay. So it is, ooh, it didn't determine that that's a buffer. So this is where tools get things wrong. So here it just says there's a character, but it's passing it into scanf and it's passing in v0, the address of that. It's saying if a0 is two, I can use n to call this argc. If I actually know that, I can rename this to be argv, but that's not used. So can I not do that? Well, sometimes there's bugs. So those things happen. So if argc is two, then print out, that's weird, a1. So see, it's like messing up on what this is. So it thinks it's a structure. Okay, so one thing you can do on both IDA and anger management and stuff is to change the types of things. Cause it, again, just like variable names disappear, the types of what things are disappear. So if I tell this that it's a character pointer pointer, it will not believe me. All right, I'll have bug reports for people after this class. Awesome. Okay, character v0, was it hex 1024? Do something like that? It should redraw too, but it's not redrawing. Oh, and the type of this function may be wrong. Anyways, okay, so let's not mess with this anymore, even though I want to fix all these problems. The point being, we can interact with a tool, we can change the name. Great, redraw that. The other weird thing is it didn't do the if else. It has two returns here. I don't know if that's because that's what the binary does. No, they both go at the same place. Anyways, okay, so at least maybe the different, I wonder if that's gonna work. This is a research project, so. Nope, great, and all my changes disappear. That's exactly what I wanted. All right, so probably don't use anger management. As you can see, it's very much a work in progress, although it looks way better than IDA, I think we all agree. But the other actual commercial and used tool is, I don't want to save. Just get me out of here, quick. No, okay. The other tool is here on my toolbar, Ghidra, so Ghidra was created by the NSA. So this is their in-house reverse engineering tool, and they released it, it was like two or three years ago for free online, and it is open source, so you can actually go look at it. It's written in Java, which is incredibly annoying, but obviously I did not make that call. So you can run Ghidra locally, some people love Ghidra, and use it a lot, and if that's you, I think that's awesome. I think Ghidra is great. I just don't, I've used IDA for so long, all the keyboards and everything, are like burned into my brain, so it's really hard for me to use other tools. So if I open, oh no, not the project, so I need to first open a new project, non-shared project, project name, test class. So you have to first create a project, and then you have to add a import file. Okay, example is going in there finally, you get cool little dragon stuff. A summary, which of course goes off the screen, which is super handy. I'm sure there's an okay button at the bottom. Then double click it, and you will be getting your Ghidra on. The big difference I think is in how the default is, so it hasn't been analyzed, but I like to analyze it. Yeah, for sure, analyze it, that's why I'm here. I don't want to look at this myself. So one thing is the decompiled view is to the right of the assembly view, so rather than by default being different, it's showing you as it goes through. So as I click into main, I see the Ghidra decompiled code here. So I see main int param one, param two, and I'm seeing the scanf local 418, so this actually defined the size of the buffer correctly, and said if param one is two, do this, otherwise do this, and then return zero, so it looks like it got slightly better output. So it's up to you if I would use the different tools, do use what speaks to you internally, what vibes with what you want, but you can do literally everything we talked about, see what the keyboard shortcuts are different, that's what really annoys me. So you can rename variable to argc, you can rename this variable to argv, you can then retype the variable and call this a character pointer pointer, and that made that look prettier, you see that? So it got rid of the typecast, and we can call you buff, and we can also define the type of buff as character 1024. Good, cool, there's our buff. So now I got passed in correctly, awesome. And then the other thing we could do as we're looking at a function, of course, like we talked about would be if this didn't have a name, we could give it names, and that helps us. Okay, but what are these functions doing? So every time a function is called, it needs to be prepared, and so we looked at even what registers that functions can use. So if you're a function, what registers can you use? Or how would I look up what registers I could use? The system v calling convention, I could look that up and I could say, oh, look, parameters are passed in rdi, rsi, rdx, rcx, r8, r9, so we can see that when we're calling a function, which ones are the arguments because they'll get passed in there just like you did with the syscalls. When we're reading a function, we can see what registers it's reading from. Functions preserve the registers rbx, rsb, rsp, r12. So that means that if we as a function want to use any of these, we need to make sure we restore it before we jump back to ever called us. But we can be free to use rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11, except we need to know that there's scratch registers and they can go away whenever we call another function. Cool, okay. And so this is exactly what every function prologue and epilogue is doing. So let's go back to, I guess let's stay on Ghidra. Okay, let's do this because I, oh no, okay, that's fine. Okay, so the prologue or so that function, what's first prologue? The prologue is this nbr64 is the very first thing, but it doesn't do anything right now. So the first thing we do is push rsb. So what's that gonna do? Not onto the stack, not the stack pointer, but rbp. So as we'll see, that's the base pointer. Yeah, it is a register that we're going to use. So we need to save it. So we look at, we push rbp, we then move the stack pointer into rbp. So wherever the current stack is, is now where our base pointer is. And then we subtract 420 hex from the base pointer. Sorry, from the stack pointer, I bet. And we can actually look, well, and then we look at the epilogue. So that's the thing at the end of the function. So the very end of the function will be ret. The other function that's, the other address that's special, sorry, the other instruction that is very complicated is the lead instruction. This does the exact opposite. It pops off. It, oh, we got this right. It pops from RSP into the, nope, okay. Let's go over the slides. So we'll go through an example of this and then I will remember. Okay, but in this example in the slides, we're first pushing rbp, move RSP into rbp, subtract a thousand hex from RSP. And then what we're doing at the end is moving rbp into RSP and then popping rbp. Yeah, there we go. Okay. So we need that place and we've actually, we've already been using a stack, right? It's just been kind of implicit, whatever was in RSP. And so the different sections, we've used the data section, which is pre-initialized global writable data. If you have string constants, they're typically an RO data, so read-only data, so data that shouldn't be modified. And BSS is used for initialized global writable data, usually global arrays without initial values, so you can write to those. But local variables like we saw on the stack and on our need to be stored for every invocation of the function. So, I believe we talked about this, but the stack grows from higher addresses to lower addresses. I personally am a big fan of the one on the far right where the stack grows down, but fundamentally it doesn't matter. It just has to grow from higher addresses to lower addresses. So when you push to the stack, you're decreasing down like the number of RSP is going down. If you're popping from the stack, it's increased. So, on any of these three examples, so the yellow values are the ones that are, so in what order were those values pushed onto the stack? Which one? I guess more specific, yeah, let's see. So this is the top of the stack, so if we did a, so it's growing negatively, so we had to push 41 and then got here, then we push 42 here and we push 43 up here. I guess another way to think about it is if you pop, which one comes off? Yeah, wait what? 43? No, 43. Yeah, so the way to think about it is the place where the stack currently is, is at the negative most, the least place. That's why I like it. Wait, are these wrong? No, no, they're right, yeah. So if we pop from the stack, we'd get 43. If we popped again, we'd get 42. If we popped a third time, we'd get 41. That's also going into reverse order. To get back to there, we'd push 41, 42, 43. Oh, literally it's on the slide. I didn't see that, sorry. Huh? Yeah, yes, none of us saw it. It's good, you're listening to me and not reading the slides. Especially when it has the answer that I just asked. Okay, so your stack, so the stack being the start, so the upper addresses are here, the current RSP points to here. And when I say RSP points to, I mean the RSP register has this exact value in memory. So there's actually, if we look at main, so we've done actually several different things here for main in our program. So we had int argc, which tells you how many arguments are passed in. We have a pointer pointer to all of those string arguments. There's actually a third argument that gets passed in the environment. And if you run the command EMV, this shows you your environment. And usually when you execute a program, it inherits the environment of, or when you fork a process, it inherits the environment of its parent. So you can control this. But anyways, for now, what we're showing is that on this example, so we have the stack pointer pointing here. Ooh, I bet, can I figure it out how to do this? I won't point. Anybody see any dot, dot, dots? Show my pointer. That's not very useful. It's not what I wanted. All right, so initially when our program's called, so this is our program, so is argc on the stack? Is it? Historically, even though it's in our register for main? Yeah, exactly. Okay, the stack is here. In our registers, the calling convention was what? What's the first one? Was it RDI and then RSI? So RDI, RSI. RDI will have argc, the number of arguments in our program. RSI will then have the address of this first place on the stack. And then argv0 will point to here, the actual program, the string in memory that contains the bytes. In this case, it's dot slash program. And whereas argv1 is a pointer here, argv2 is a pointer here, these are all in the stack. So the operating system sets us all up, so our program does not have to worry about this. And that's what the, that's like actually what main means is a contract between your program and the operating system. So if the operating system can tell you how many arguments you're called with and what those arguments are and how to access the environment. So if you're the next one and VP, so your environment argument, the third argument, argc, argv environment, this is an array of pointers that then each of them points the values. Anyways, this is what the stack could look like. And now, when we want to call a function, so one of the key important things, this is what enables as we'll see like classic forms of memory corruption is altering memory on the stack that is these return addresses. When you call a function, a call is essentially a jump to a specific address and push the next instruction to be executed onto the stack. So this, at this function here, so we have a call to main. What happens is when that happens, we start executing main. So the program counter is the instruction pointer, sorry, this specifically the RIP register is set to this value. And what happens implicitly is this memory address which is a 401-055 gets pushed onto the stack. So this is a 401-055 onto the stack and a return instruction at the end of a function, a RET implicitly pops that and is a pop RIP. So a RET literally does nothing but pop RIP. So whatever the stack currently points to, just start executing from there. You can think of it like, Amy, you know the story of like Hansel and Gretel? So there's like a fairy tale. You can go read it up after this or go look it up online. I'm sure there's a Wikipedia article. The idea is these children go exploring in the woods. And they think, aha, I know, I will figure out how to get back by taking a loaf of bread with me and leaving a trail of pieces of bread as they go exploring in the forest. Of course, the problem is anything can happen. And as you go backwards, you may lose the trail because animals have eaten the bread. If you hadn't a smart adversary like yourselves, you can actually cleverly make that trail to make the children, I don't know. Go into an oven or whatever happens at the end. Kind of in there. I'm mixing the story, but I think that is a part of the story that you can look at, but it doesn't happen like that. But anyways, the point is the breadcrumbs. So these addresses stored on the stack, which is just memory, if anybody changes them, your program does not know that. It is not smart. Remember, your CPU is incredibly dumb. Whenever it sees a RET, it just pops RIP. Okay, then in the epilogue, every function needs to set up its stack frame. So it needs a reserve space on the stack for it to execute. That is why our program was subtracting 420 hex so that it had enough bytes for its buffer on the stack. So usually, so RSP will point to the, where this always points to where the stack currently is, the base pointer, RBP points to into the stack frame of the function. So specifically, okay, we're towards the end. So specifically, the prologue does a push RBP. So after a call, so the first thing that happens when a function is called is implicitly the return address is stored under the stack. The next thing that's stored under the stack is the saved base pointer. So that was that push RBP that we saw in every function. If it uses a base pointer and needs a frame, it will store it. Then moves the stack pointer to the base pointer so that the base pointer is currently pointing here in this case and then subtracts, in this case, 100 hex which then means in this example, oh yeah, I got it. Okay, perfect. Yeah, turn on the pen. Oh my God. Okay, so this at this point would be RSP and this would be RBP for our function. Right, so we essentially quote allocated. And again, here we're not, we don't have to ask the opening system for memory. All we need to do is subtract from the RSP in order to quote allocate memory on the stack. Now, when we want to return, we need to reverse that process. So we had RSP here, we had RBP here. Am I correct? Oh, it saves it. This is so cool. This is gonna revolutionize teaching. Okay, the first thing we do is move the base pointer back into the stack pointer. So move RBP RSP and this just changes RSP from previously pointing, oh please tell me I can erase. I cannot erase. Okay, so RSP will no longer point here, it will now point here and then does a pop RBP which then moves whatever was on the stack into RBP. And so now RSP points here and RBP will point somewhere else of whoever called us because we stored that as soon as we were called. Now when we call the RET, we return and start executing back up here. All right, anyways, we'll stop here. Module will be now in Slayer today. We'll continue this. It'll actually start with debugging so you'll learn GDD stuff, which is super fun. And then there'll be reverse engineering challenges in there. So it is on the stack, hard to see. It is? I wonder if that's backwards.