 Welcome. This talk is called Descent into Darkness. Understanding your system's binary interface is the only way out. So we're going to talk about what that means shortly. So who am I? I'm Joe D'Amato. I live in San Francisco. Used to work at VMware, went to CMU, some stuff that I work on. I wrote the first version of Memproff that you saw the web demo of yesterday. I wrote some L-trace patches, some MRI thread improvement patches and stuff. I maintain a blog, TimeToBleed.com. That's the actual URL if you don't believe me. And on Twitter I'm just Joe D'Amato. Alright, so enough of that. So we've only got 30 minutes right now. So probably won't have enough time to get through everything, but I'm just going to try. So I just got to say welcome to flight school. We just got to roll right now. So I have no clue why this talk was actually accepted. There's only about five lines of Ruby code in it. But before we actually get started, I need to introduce you to a really, really good friend of mine, Satan. And the reason why I need to meet my good buddy Satan is because this talk is basically about how being evil is totally awesome. So in other words, you shouldn't actually do any of the stuff that I'm talking about in this talk ever unless you do it in a VM because you could seriously destroy your system. But if you get it working right, like amazing things are possible, which is what we're going to show. Okay, so what's the problem? Basically the problem is my reprocess is 700 megabytes and I want to know why. So I don't know if you guys saw it, you who just saw it yesterday when you were saying object allocations were free. I was like having a seizure in the corner over there. Alright, so there's Ruby. It's large and slow and we want it to be small and fast like that guy. So what's the problem? The problem is it's really easy to leak references in your Ruby code. So if you leak a reference to an object, that causes that object, any object that that object references to stick around in memory forever. So let me show you a picture of what that means. So as long as somebody somewhere is holding a reference to the object on the top there in gray, this entire tree of objects cannot be freed and that can actually add up to a lot of memory really fast. But sort of also really important too is that every GC cycle, GC is going to scan this tree of objects and it's going to burn CPU cycles looking at it saying like, hey are these free yet? No, are these free yet? No, over and over and over again. And that can actually add up to quite a bit of CPU that's being burned every GC cycle. But you might say, hey, memory is cheap, CPUs are cheap, I don't really care, I'm not scared. But I would say, hey, Ruby's GC is a naive stop the world market sweep GC, which basically means that the more objects that stick around in memory that you don't need, the longer your GC runs are going to take. And the longer GC runs take, the less time your app actually has to run Ruby code. And that's bad, obviously. So we want to eliminate, if we eliminate leak references, we reduce the length of GC runs, run more Ruby app code, everybody's happy. And that's cool. We want to do that but how? How can we track down these reference leaks? So the problem requirements, right? Anybody who knows me knows that I'm really lazy and I don't want to do any work that I don't have to do or that I can make somebody else do for me. So I don't want to apply patches to Ruby and I don't want to rebuild Ruby because doing that's hard. I don't want to break binary compatibility with my extensions. I just want to do this once, just have it work. I really wanted to just be a gem. I just want to require Ruby gem, do memory profiling and then just be done. And then anything other than that is way too much work. Luckily my boy Sane has my back. We can do some really, really nasty things to actually get this going right. I'm actually going to skip these two slides real quick because I don't think we're going to have time to get through everything if I don't. But all you really need to know is that AMD64 is just a processor spec and Intel and AMD both, so AMD invented the AMD64 processor spec but Intel implemented it as well and they named it Intel64. So if you see AMD64 on here anywhere, it's actually applies to Intel CPUs as well. It's just the name of the spec. So I'm going to skip this too. This is basically just saying they both try to implement it and whatever. Anyway so what's an ABI? Application binary interface what does that mean? According to Wikipedia it describes the low level interface between a program and the operating system or another application. That doesn't really mean that much but what's contained inside of it are things like alignment. So it'll specify the alignment of data types, calling conventions, object file and library formats and sometimes two syscalls, how they live, how they work, whatever. So in terms of ABI's like how do you actually find them? What's the deal? So there's a system 5 ABI which is like the big one above all of them. It's 271 pages and then underneath it there's architecture specific supplements for each of them. And the awesome thing about the AMD64 supplement is that it says yo include everything in the 3D6 supplement except with these changes. So if you actually want to do any like low level work on the 64 bit X86 you have to read all three of those. And if you know there's other specs for other architectures too. But luckily I brought copies of all three of these and we can just chill and read them like for the rest of the day. No but like actually let's blaze through some of this stuff right now. So we got to use some evil devices to get some of this stuff going. You have NM. NM comes default on OSX and Linux. It just dumps a symbol table. ObjDump will disassemble lots of different object formats. It'll disassemble elf, Mako, lots of different things you can do with it. ReadElf is Linux specific and it'll dump elf specific information about a binary which we'll talk about very shortly. And then there's also DorfDump which will Dorf debugging information. So your sample usage of NM, I'm using NM on the Ruby binary. On the left right there you see symbol value. It's value in quotes depending on the object. The value could be different. It could be an absolute address or just an offset. On the right you have a bunch of symbol names. And then in the middle there's some stuff. You can read about it on the main page. ObjDump. This is sorry I'm going so fast. I just got to get through some of this initial stuff. This boilerplate stuff. Alright so ObjDump-D. All I'm saying is yo, ObjDump disassemble all of Ruby and give me some data. On the left there you have the offsets. To the right of that you have the op codes. So those are the actual machine codes. So that's like the actual bytes the CPU is reading and executing. And then next to that are the actual instructions. So those are those bytes translated into assembly instructions. And then on the far right is this helpful metadata that LibOp code will output and tell you hey you know this line is referring to a constant called DoringGC or a global DoringGC and this thing is referring to malloc limit whatever. Just to make it easier to understand what's going on. You also have readElf which will dump out elf information. There's a lot more output you can get than just this. There's also DorfDump you know same idea. Okay cool. So we need to meet some new friends before we can really get down and dirty. So registers they're important. They're small, fast pieces of memory that live on a CPU. And some registers have specific jobs. So two examples of registers that have jobs are RAX or RACs. It holds the return value from a function and there's also RIP which is the instruction pointer. Keep track of which instruction you're executing. You can also refer to registers in pieces. So let's take a look at what that means. So here's RACs uncensored. On the bottom that's the full 64-bit RACs register. That guy takes up 8 bytes and in IntelSpeak that's one quad word. There's also a register called EAX which is just the lower 32 bits of RACs. So it's referring to the same piece of memory but it's just the lower 32 bits. And this guy is 4 bytes wide or 1D word, 1 double word. There's a lower 16 bits of that called AX and then AX is split into two pieces, the upper 8 and the lower 8. So you can sort of see by looking at this diagram, this is sort of like you're like digging up like remains. This is like 16-bit CPUs, 32-bit CPUs, and then 64-bit CPUs. They just kept gluing stuff on to the side. So just some other simple notes we need to know. There's two different syntaxes for assembly just to make your life that much more painful. There's AT&T slash gas syntax and Intel syntax. By default all the GNU tools display stuff in gas AT&T syntax but you can set it to Intel if you want by using those commands. I personally like gas, that's just what I learned on. I don't know, Intel is hard to read I think but you know whatever floats your boat. And so unless otherwise noted all the assembly in the slides coming up will be in AT&T gas syntax. And yes there is a lot of assembly coming up. So moving stuff. So everybody else took a poll and I feel left out if I don't take a poll. So show of hands if you thought you were going to learn assembly today. Awesome. Alright so moving stuff. How does this work? So this is only AT&T syntax. Keep that in mind. So you have source, comma destination. You can move immediate values like that second line right there. I'm moving zero into the register RBX or I can move a register to a register. So I'm moving EAX into RAX. And if you remember from the last slide we just looked at EAX is a lower 32 so I'm basically truncating whatever is in racks to be only 32 bits wide. Important note, source, destination cannot both be memory. Calling functions. Lots and lots of different ways to call functions on the 64 bit CPU. There's only two ways that we really care about for this talk. There's the indirect absolute way where you can think of essentially RBX as a pointer. And the value of that pointer is the absolute address of a function. So that's quite literally dereferencing RBX and executing the code that's found there. The next instruction is just it shows an absolute address. Dead beef is the absolute address but that calculation is done relative to the next instruction pointer with a 32 bit displacement. I know that makes absolutely no sense. I'm going to show a diagram of that shortly. So calling conventions. How do you call functions in assembly? Well, arguments live in registers from left to right. So the first argument lives in RDI, second in RSI, so on and so forth. And that's for integer class items. There's lots of different classes. And stuff gets passed on the stack too just like on I3, RBX. And the important thing this is going to come back to bytes in the ass later is that the end of the argument area must be aligned on a 16 byte boundary to comply with the ABI. And also there's these huge contracts about whether or not you're supposed to save function, whether you're supposed to save registers yourself or the guy you call supposed to save registers. So each person has a specific job. If you're the callee, you have to save certain things. If you're a caller, you have to save other things. Alright, so I'm going to very quickly walk through some assembly code and the matching C code that goes along with it. On the left it goes over, yeah, on the right, whatever. On the one side is Intel and on the other side is AT&T. And then on the bottom is C code. So on the top what we're doing, these two guys are just saving the old stack frame base pointer and they're setting the base pointer to the current stack pointer. So they're just basically carving out a slot saying, yo, we're going to start referencing things from here. These two instructions right here are doing the initial setup. So that sort of move EDI you see on the top, that first instruction. So if you remember from the slide before, RDI is the first argument to a function. And as we looked at in RACS Uncensored you can refer to pieces of registers. So we're passing an amount down here in the C function and amount is only 4 bytes wide. So GCC is outputting an instruction that's saying, hey, it's only 4 bytes wide. So just move the lower 4 bytes EDI into this random spot on the stack RBP minus 14. And then it's setting another spot on the stack equal to 0. Which matches up with that ret equals 0 line right there. So next what's going down is we're moving whatever we just moved into RBP minus 14. So RBP minus 14 just had those lower 4 bytes which is amount. So that first instruction up top matches up with that first line right there which is just moving amount into the register EAX. The second instruction is just an add. It's saying, yo, take whatever's in EAX, add 150 to it and then the result of that lives in EAX. So that matches up with that line of C code. And then these two instructions are actually totally useless. They're just generated by GCC and they would be removed on higher optimization levels. And then this just returns from the function and the result lives in RAX and you can plug the ABI and everybody's happy. Cool. Okay, so L-objects. What are L-objects? What's the deal? They can be executables, they can be libraries. This is what they look like if you were to diagram them. So each shared library you load into the process has its own set of that that goes along with it. And each set is independent and they all kind of work together using the runtime dynamic linker. So they have headers. Headers help describe the object. There's different types of headers to describe segments or sections. Segments are collections of sections. And so Memproff actually uses libelf to wander through an elf object and extract useful information. But like I just said, the executable in each shared object just has its own set of data. And sections that matter to this Memproff adventure we're going on right now is the text segment, that's where code lives. There's the PLT which we'll talk about soon which helps resolve absolute function addresses because shared objects can be mapped anywhere in your address space. And you don't know until runtime the actual address that a function will be at. So you need this PLT thing to help you figure out where it lives. And then the PLT needs this other table called the got PLT for storage. Okay, so PLT, PLT, what's the deal with the PLT? Well it's the procedure linkage table. It's used to find functions in shared libraries at runtime. Shared libraries are precision independent, they can be mapped anywhere in the address space. So by now you should be thinking, what does this have anything to do with Ruby? And well it has a lot of things to do with Ruby actually. These are actually the ingredients we need to do serious, serious evil. We know the AMD64 ABI or we can at least look up whatever we need to know. We know how elf objects work and we know why I didn't actually go over it so I just don't have time but just take my word for it that in the RubyVM when you allocate an object you know with like string.new or whatever it calls in the RubyVM rb new object and when it frees it it calls a function let's just say it's called add free list. So you won't. This is pretty much what this guy Brian Mitchell said to me a long time ago when I proposed this idea. He said you won't combine all the stuff you just talked about and rewrite the RubyVM in memory while it's executing code. You should be scared like her. Sorry, I'm going to rewrite Ruby while it's running. I apologize for destroying your VM. So we want to track allocations that freeze. We want to do it as a Ruby gem so what's the deal? We know that the RubyVM calls rb new object to allocate a new object. This happens so we can track objects. So we know what call instructions look like. We just talked about calling functions so why don't we just scan the Ruby binary in memory and just rewrite all the function calls to rb new object to call a handler of ours instead. I'm not that scared so let's just do it. So you disassemble Ruby and this is what you get. There's just some function somewhere calling rb new object. That guy there on the left in blue that's the address of this instruction. That byte right there that's actually the call op code and then the next four this is sort of the four byte displacement. So it's trying to call rb new object so over there in orange you see rb new object and you see that sort of hex strings to the left. That hex strings to the left is the absolute address of rb new object. That's where rb new object lives in memory and it's calculating that by adding those two things together. That's how relative call instructions work. So let's just overwrite that part in red, the displacement, so that all calls to rb new object actually call a different function instead. And the other function that we're going to call instead might look something like this, like call rb new object ourselves, do something to track this object and then return the object. If we don't actually call rb new object and return the object then all hell will break loose because the VM was expecting you to return memory for an object so we need to actually do what rb new object was doing but do whatever else we want as well. It doesn't work for everybody though so this only works for rubies that are built with disabled shared that don't have a lib ruby.so but if you build ruby with enable shared and so this is how most Debian, Ubuntu, et cetera boxes build ruby with a lib ruby.so then code in lib ruby so calls rb new object via the plt so we need to do something else to handle that case so let's look at how the plt works really quick. So you have this assembly instruction in a ruby library that was, in a ruby binary that was built for shared you'll have this instruction right. So over there in blue on the right you see it's calling rb new object at plt so if we go and we just assemble what lives there wow these colors are really bad I'm sorry guys these slides are on time to bleed if you want to fall on your iPhone or whatever. So anyway at that instruction if we disassemble it we see these three instructions a jump a push and a jump so what's actually going on here well over there on the right that's just sort of metadata saying hey this jump is referring to this absolute address and this is an entry in the got plt so initially when a shared library is loaded into memory it has no idea where the function you're trying to call actually lives. So what it does at first is the entry in the table just points to the instruction following the jump which seems like it makes no sense but it actually will make sense very very shortly. So initially it's set up this way because it has no idea where it's living so that instruction is actually pushing an ID it's saving an ID out for state and then that second jump is invoking the runtime dynamic linker so the runtime dynamic linker runs it looks at this ID that was saved and it says okay word I know where this thing lives I'm going to fill in the address in the table and then transfer execution to the function rb new object and then this acts like a cache so the next time when you call rb new object at plt it just jumps directly to rb new object without re-invoking the runtime dynamic linker cool all right I know I just meant like is everybody like we good right now everybody have questions all right word okay so basically we hook so this is a global offset we hook the global offset table by we redirect execution by overwriting that entry so instead of rb new object we call our handler function that's sort of like so this whole wow I just killed that so like this whole setup is the initial setup right and then the linker runs and it like writes this thing in there so that next time you hit rb new object correctly so we want to call our function instead of rb new object so we can just pretend where that we're the the runtime dynamic linker so we'll just say yo I am the linker I'm going to go in find the table and then write the address for my function there instead completely bypass the runtime dynamic linker so what that looks like in the sort of using that same diagram is you have the plt entry and then that entry in the table actually just directly refers to my other function so we just completely sidestep that push and that other jump to just go and call our other handler which that's actually a copy paste oh no that's right that's actually right yeah that's right so our handler calls rb new object does some tracking stuff and then returns it so you should be thinking yo like your other function is calling rb new object like is that infinite loop like want to just end up calling itself over and over again no keep in mind that thing I said before about each shared object has its own state other function happens to live in the mem prof shared library as long as we don't modify mem prof but modify everybody else then we're good right because mem prof is still going to resolve symbols in normal way but everybody else is screwed because I just wrote over their tables so we're now tracking object allocation right but we need to find links we need to track when objects get freed add free list is called in the vm when an object is freed want to just overwrite the call instruction for that or hook the god or you know whatever what we just did before just do that again actually we can't do that because this add free list function is inlined that right there so that basically suggests to gcc like yo you can build this thing output a set of instructions and then I'll take those instructions insert them into whoever called me and then optimize the function from there so if the thing gets inlined you won't actually see any calls to add free list so what now you know are we jacked can we do anything what's the deal well if we carefully study this small snippet of code we notice that free list is getting updated right and free list has file level scope so this thing just lives at some static address somewhere and we're just overwriting it every time we update the free list so why don't we do something really stupid we know there's some static address we have free list updates to free list so why not search the binary for move instructions that have free list as the target and just overwrite all of those so let's do that but we have a problem we have a problem the problem is that the system isn't ready for a call instruction yet and what does that mean well the 64 bit a bi says that the stack has to be aligned to a 16 byte boundary after any and all arguments have been arranged okay since we're just overwriting some random move we have no idea what the state of the system is and we can't guarantee that we won't completely fuck something up by just doing a call so we can't just plop a call instruction in there we have to do something different we need to obey the a bi we need to move things out of the way we need to align the stack we need to save the register we're supposed to save but you know what can we do to do that right well we can do a jump instruction jumps have no restrictions we can just jump anywhere the difference between a jump and a call is that call saves a return address so you can do a call and you can go back to where you were before a jump leaves no trace of where you came from so we can transfer execution to an assembly stub that stub will set up the system in accordance with the a bi and then that stub will transfer to our c level handler which will do stuff and then we have to make sure to jump back to where we overwrote that instruction when we're done because as we said before jump saves no state okay so let's do it there's an instruction that's updating the free list we can't overwrite it with a call instruction because the state of system is not ready for a function call as I just said so we inserted jump instead now we can't change the size of the binary so the move instruction up there is pretty wide it's seven bytes wide the jump is not quite as wide it's only five bytes wide but you can't change the size of things so we're going to insert two no ops just to pad it out and then that address right there that's just the address of the assembly stub and then when we're done with the assembly stub we have to jump back here fall through the no ops then continue executing code in the VM as if nothing happened so don't worry I'm not going to go through all of this but this is basically the slightly abbreviated version of the assembly stub that's needed to make hooking free list work so you overwrite I wonder if I can get my pointer over there so I can show you something probably not anyway so that first instruction at the top right there so we overwrite the move that overwrites the free list we're basically saying okay this thing's updating the free list we're going to completely stomp on it and insert a jump but we have to actually update the free list otherwise the state in the VM is completely inconsistent so we need to figure out like how the filter is being updated regenerate that instruction to actually do it save all the stuff we need to save set the world up to align the stack like if you see right there two spots up from that green box there's an and instruction that's actually aligning the stack to a 16 byte boundary and that's sort of complying with the ABI and then that call instruction is transferring to our handler our handler says yo this object was just freed let's mark that in our internal state we return from this handler back to that leave instruction we restore a whole bunch of state and then we continue executing the Ruby VM as if nothing happened and surprisingly this actually works so here's a sample of how you might use the gem in a really simple way you just have a you require memproff you call memproff start you require a bunch of other stuff and then I'll give you output like that and the important thing to notice about this output is that it's not just telling you like yo you're you're you're just taking up a lot of memory it's telling you like yo like these are the actual numbers of objects being allocated on this specific like in this specific file on this line okay so it's telling you exactly where you're leaking memory and it even includes the type too so you can see what objects you're leaking but the really really important thing about this slide that you should all be like completely out of your mind about is that this slide actually contains Ruby code oh my god so you know that's cool but what else can we do with memproff well we can also just track a block we can call memproff.start track pass it a file do stuff and then track like memory allocations and memory leaks of a specific block or we can dump the entire heap as JSON and if you dump the entire heap as JSON you can load it into the beautiful web UI that Amon demoed yesterday at the lightning talk so anyway there's there's also a middleware component too like I was thinking like you know we were thinking like oh wouldn't this be cool if you could insert memproff as middleware and just get per request object count information that way you can see like per request I'm leaking like a hundred controllers or whatever so you can do that you install the memproff gem you can require memproff middleware set up some middlewares and then each request to get an output like this telling you the number of objects the file and the line that these things are being leaked from so you can easily track down what's wrong with your applications and fix your memory bloat so as Amon demoed yesterday there's memproff.com you can go there check it out play around with it that's sort of the web UI that's built on top of the output from the gem so the gem outputs data the web UI parses it and makes it really easy to read so there's a couple limitations from memproff it only works on AMD64 linux and snow leopard there is some 32-bit support but it doesn't really work working on it soon hopefully it only works on MRI and RE is 1.8 right now it only works on binaries that are not stripped and OSX system Ruby is not supported because it is really stripped. Support for EY rubies those are stripped but support for those is forthcoming as is support for stripped rubies in like Dabid and Ubuntu there's a little more work though for the user end the user will have to install a debug package that comes with their package manager so nothing like too crazy just an app to get installed or merge or whatever and to get debugging symbols and I'm working on support for that I was working on it this weekend actually and don't worry too there's more evil brewing and lots of really crazy stupid ideas that we hope you'll love so stay tuned to find out what they are and just as a hint 1.9 support for memproff is right around the corner that's one of the ideas we're playing around with we're pretty close but anyway we're getting close to the end of the talk I just need to say thanks to a bunch of people who without their contributions this would have been impossible so use RVM without RVM without Wayne's project this would have honestly been impossible for us in all different rubies so use this project give him money he's awesome what's up hope you were clapping for Wayne because I'm not done yet so you can get memproff just to be completely clear this talk was about the memproff ruby gem the gem is completely free open source et cetera and provides text based output you can get it at github so my github account right there so you don't want to do this yet because I haven't cut a new release and the current release on gemcutter is pretty old so you can just do that if you want to come hang out on RC and talk about how this works and if you want to help out too jump in there and then memproff.com is a separate thing they're working on on the side that just sort of generalizes it loads in data from the full heap dump excuse me and makes it nice and easy to read and that's sort of an alpha motor right now special thanks to a couple people special thanks to Aman for making the beautiful web UI the JSON output and much much more thanks to Jake Douglas and Jake up there none of the Apple people would be able to run this on your systems I kept telling Jake when I was working on this I don't care about Apple I just want to make this work on Linux and he was like well I care so he wrote it so he's awesome Brian Mario or Brian Lopez because he's cool and this guy over here with the camera Brian Mitchell because if he didn't tell me not to do it then I wouldn't have done it or something so questions and yeah that's it thank you what's up man I don't know you asked if there was any chance that this would work with a readable rightable memory module oh yeah yeah yeah so the NX bit yeah this works with NX because I just use N protect and I just say I don't care I have additional materials about that level of engineering I was just wondering is there something you recommend low level programming for high level programmers people ask this all the time and I really don't know of an existing resource that type of information you just got to read lots of code just disassemble stuff you can read the Intel instruction set architecture there's like some amount of knowledge that you just got to go out there and just try to do it usually on my blog I blog about this type of stuff and try to lay it out and explain it in a way that I hope people can understand like people who aren't really familiar with assembly and really low level stuff so if that's your thing you can check out Time to Bleed in Fiverr find any good reading materials there I'll definitely post them and link to them what's up you have a question works on RE MRI anybody else how does this compare to the state of either JRuby or Python I haven't looked in JRuby but I can so all this evil stuff I just did I can do the same thing in Python because I'm not scared I mean it has to be rewritten so it'll work with the Python VM obviously but the same tricks will work because Python you can load C extensions so as long as you can load a C extension then you can do whatever you want I actually answered the previous question about the materials on that stuff so there's a book it's called Programming from the Ground Up it's a free book so you can read it online and it has all this stuff on Linus assembly and it's very easy to read and the memproff source is actually really well documented we've been working hard on it like the last couple weekends Jake and I so you can definitely check out the memproff source to get some information as well what's up man so dynamically insert details so the method yeah you can inject you can so like using like sort of this base framework like the way like memproff actually works you can do whatever you can do literally whatever you want you can rewrite Ruby to do it like if you just say like hey I don't like the threading implementation you can just like remove the one that's in there and insert your own and do whatever you want yeah you can do anything dude alright cool thanks a lot guys