 E-Trust, if you use it on the command line, it's sort of like a wrapper around E-Trace. It doesn't give as fine-grained output, but it's useful. Alright, so it's S-Trace Ruby. What happens when you S-Trace Ruby? You get this. So, you know, what's the SIG VT alarm? They just come up like crazy all over the place. What does that mean? So it turns out that Ruby uses a system called SetEyeTimer and signals to schedule green threads. And this only happens in the case where you're not building Ruby with enabled P-Thread, which we'll get into in a little bit. But the first time a new thread is created in Ruby, Ruby calls SetEyeTimer to create a timer, and it tells the kernel, hey, every 10 milliseconds, send me a SIG VT alarm. I want to know what's going on. And then when Ruby gets that signal, it fires a handler called CatchTimer. So the second call down there, a positive signal is attaching that handler to that signal. So let's take a look at the code in the Ruby VM. This is sort of an abbreviated version of what's going on. You can see here that you have your RB ThreadStart zero function over there on the right. That's called every time a new thread is started. The first time you start a thread, it'll flip the flag thread in it, saying, hey, we started the timer. You're good to go. A positive signal right there is attaching the CatchTimer handler function to the signal. And RB ThreadStart timer is calling SetTimer down there, saying, hey, our handler is set up. We want to get signals every 10 milliseconds so we can time stuff. So if you STrace Ruby, you can actually see this happen. You attach STrace, you see a call SetEyeTimer. You see the SIG VT alarms come in. Everything's cool. But the big problem here is that... So the big problem is that when you start one thread, even after all your threads die, the timer still happens every 10 milliseconds, interrupting all of your code, which is bad. And if you STrace your Ruby code, you may say, hey, I'm not using threads, so what's the deal? Well, NetHDP uses Timeout, and NetSNP also uses Timeout, and Timeout is built on threads. And so once you spawn a single thread, this timer that's hitting your reprocess, like I said before, will continue interrupting your Ruby process forever, which is bad. So we wrote a patch to the Ruby VM. Stop the thread timer. It's pretty simple. This check basically says, hey, if I'm the last thread, turn off the thread timer, stop interrupting my Ruby process. I want to be able to run code. You attach STrace. You can see the timer starts. Some threads were spawned. Alarms came in, and then the timer was turned off. So this is actually a pretty big win. Our code started running faster. We didn't have to worry about stuff. So the next big performance improvement we did on the threading implementation was, hey, Debian, Debian servers in production, we STraced our Ruby process, because we were like, oh, wow, this Ruby process is really, really slow. What's the deal? So we attached STrace. We saw all these calls to SIGprocMask. We're like, okay, let's get a count of how many calls there are. Looks like a lot. We run STrace, and there's three and a half million calls to SIGprocMask in about 100 seconds, which is a large number of system calls. So what's the deal with that? Why is that happening? Well, it turns out that when you enable pthread, what Ruby VM is actually doing, sort of a little bit confusing. A lot of people think that when you pass enable pthread to configure, when you build Ruby 1.8, that you're telling Ruby to use native threads. That's not the case. When you pass enable pthread, it's saying, hey, use a native thread on the system to do the timing for the green thread implementation. And so it's also enabling pthread is also useful for compatibility with Ruby TK and other stuff that uses native threading. But if you just look at the diff of what happens when you enable pthread, you get these other defines that pop up. And then those defines create your timer thread if you're using enable pthread and those other defines go on too, which we'll talk about what that means right now. So it turns out that if you look at the bottom two defines there, it's enabling getContext and setContext. So what are those functions? What do they mean? What do they do? What's the deal? GetContext and setContext are part of a system in the kernel called uContext. And it turns out that Ruby can either use setJump and longJump or setContext and getContext in the threading implementation and for exception handling. And so what they do is setContext and setJump and longJump family, save and restore the current CPU state. So you can save state, execute some code, something bad happens, restore and go back to wherever before. So setJump and longJump do similar things to getContext and setContext, except that uContext is sort of a more advanced version that allows you to modify the internal state. The downside is though is that these two functions save and restore the signal mask and hence call sigprocmask, which is why we're hitting three and a half million of these calls every time we run Ruby for a short amount of time. So just simple patch to fix this guy. Just patch to get figure scripts, tell it, enable new flag called disable uContext. So what this does is says, hey, I want the timer thread, but I don't want you to call sigprocmask. So with this patch you can S-Trace again. All the sigprocmask calls are gone and Ruby's now 30% faster, which is pretty sick. Cool. So I maintained the event machine gem and there was this long staying problem that people and with threads in the Ruby VM, everything would basically be unusably slow. So I decided I need to take a look at this and this was especially a problem because we were using thin HCP, which would enable or it would start the timer signals coming in and everything would sort of grind to a halt. So I knew I had to profile it. So I started out basically building a simple repro-case basically. So an event machine basically handles network IO and it allocates big buffers on the stack to copy incoming data. And so I wrote a simple extension that basically you call this C function which allocates a large buffer on the stack and then after allocating that buffer it goes back in Ruby and basically executes a bunch of Ruby code and does a lot of threading. And so I started running this through a profiler and we decided to use Google Pro Tools which is basically the profiler that Google uses internally. It's really cool. It works really well. So the way you use this is you can download it, compile it. The way you use it is it builds a shared library that you can either link to your application or preload. So on Linux you can set LD preload or an OSX or an equivalent. And basically once you set this environment variable any binaries you launch will first load this library before doing anything else. So once you've loaded this library all you have to do is set a environment variable called cpu-profile and point that at a file name. And once the binary finishes running it's going to dump out a whole bunch of statistics to that file. Once you've created that file all you have to do is run this pro script that they bundled called pprof on that and it gives you a bunch of output. The cool thing about this profiler is not only does it have really useful text output it can create these really nice graphs and you can just look at this graph and tell right away what's taking the most amount of time. I ran this on the event machine threading problem and I got back something like this and there were some candidates that made sense definitely third save context third restore context but it turned out they were all calling mem copy and this was really surprising I didn't really believe it at first so I was like this can't be true like mem copy really has taken that long so I decided to try yet another tool and try to confirm that this was actually happening so this other tool is called ltrace it's very similar to strace but it traces library calls instead of system calls and the syntax is almost exactly the same so you can run this again with dash c which sort of gives you a summary and sure enough mem copy is the the first one on there is taking a large amount of time again you can run it in detailed mode and see get a little bit more information about what's going on in this case sick vt alarm right after the sick vt alarm there were two calls to mem copy happening and all of these calls were adding up so we know it is definitely calling mem copy but the question was what is it copying and why is it copying so much stuff like what exactly is going on so we know it's getting called from thread safe context and reach store context so we can pull up the ccode for that and sort of walk through it so the first thing that safe context does is call set jump which we talked about and we know that that saves the cpu state of way it's calling the mem copy that's where the mem copy is and it looks like it's accessing the earned stack position and copying something so it turns out it's actually copying the entire stack associated with that thread so all the stack frames in that thread are getting copied away and then the third thing it does is save a whole bunch of VM globals that basically tell the VM where it is away into the thread structure and in restore context it basically does the same thing in the opposite order so it restores all those globals then it mem copies the stack back and then it long jumps to where the cpu had saved state so we sort of have an idea of what's going on it's actually copying stacks to the heap but what exactly does that mean just gonna explain everybody cool is there any questions we're at so far stacks versus heaps alright so stacks what's the deal storage for local variables those variables are only valid while the stack frame is on the stack and as you call functions function calls push metadata onto the stack to keep track of where you're called from we're gonna go through a diagram in a second you also have a heap storage variables that persist across function calls and it's typically managed by your malloc implementation so libc or tcmalloc or wherever you use so you have this function 1 right here that you're calling function 1 allocates a void pointer called data let's just assume we're on 32 bit so those 4 bytes live on the stack he calls func2 so that pushes some metadata onto the stack to say hey we're calling this function to go back to func1 so func2 is allocating a hr star string the storage for that pointer the pointer itself is stored on the stack the 4 bytes but the call to malloc right there is gonna put 10 bytes on the heap plus metadata needed by the malloc implementation and then it's gonna call function 3 function 3 is allocating a buffer on the stack a thing called buffer 8 bytes go there on the stack once func3 returns those 8 bytes are gone and it's no longer valid and so that's sort of the deal basic gs of stacks versus heaps so we're mem copying the thread stacks what does that mean at a high level so at a high level you have your it's kind of cut off but you got your Ruby process executing and over here on the left you have your current program stack and then these are the other thread stacks and your Ruby process that are saved on the heap waiting for when it's their turn to run so when a timer comes in or the schedule decides it's your turn to run it copies the entire current program stack onto the heap to save the state and then it copies the next guy to run from the heap onto the current program stack over itself so this is interesting like we'll figure out what's going on but first maybe it's interesting to find out what's actually on these thread stacks so we use gdb to figure out what the deal is I'm going to go through a really quick gdb overview for anybody who's not used it before if you're going to experiment make sure to build gdb make sure to build your app with dash gdb and o0 otherwise you'll have to read a lot of assembly so gdb walkthrough not too intense you run gdb you pass in your program let's start it up there it is you can put a breakpoint on a function so in this case this program just calculates the average of two numbers so we set a breakpoint on average we say hey keep running gdb hits the breakpoint freezes the program and says yo I just called average here are the two arguments this is the line of code I'm on and see you know what do you want to do so I said okay give me a backtrace that'll show me okay I I'm in the function average, average was called from main this is sort of that idea of a stack that we just talked about a couple slides ago and you can ask gdb to so yes that's the function stack so you can ask gdb to give you c code line by line so you can type s and press enter and then gdb will execute one line of c code and then let you do something else you can output local variables by just running p and then the variable name so gdb has lots of stuff this is just sort of a quick overview of some of the useful features you guys should definitely check it out alright so what's on the ruby stack so we attach gdb to ruby we hit backtrace we want to see what's going on and this is just a small snippet of the entire stack trace it's pretty massive um as you can see all c programs including ruby have a main function main is all the way down here immediately calls ruby run which starts the ruby vm the fragments from our ruby code like I'm sure you guys have used you know things in numeric like if you say 5,000 dot times and then pass a block that actually calls the c function in due time and then in due times ends up calling like right above that rb yield t yield to the block that was passed in so that's sort of how like the inner workings of the vm go down and if you notice there's lots of calls in the stack frame in the stack trace to rbeval turns out that rbeval evaluates code in your ruby program and rbeval calls itself recursively throughout the execution time of your program um and that's sort of important so we want to see like sort of you know how big are these stack frames what's the story like you know we're seeing mem copy we're seeing lots of being copied like just get an idea of how big we're talking about so this is kind of like a little bit of magic right here but basically what's going on is I'm saying I'm in gdb and I'm saying yo gdb where I'm at right now I want to get the base pointer for the current stack frame and I want to subtract from it the bottom of the stack esp these are just two cpu registers so I give this print out and look what comes out 968 bytes so each rbeval stack frame is almost one kilobyte which is you know a large amount of space to be copy back forth and if you want to get the entire distance of the stack like the entire ruby program stack right now ruby has an internal variable rbgc stack start you can subtract the current bottom of the stack from that and say oh look the ruby stack is you know 10k we'll get mem copied back and forth so you know if you have 50 method calls each with 1k stack frames you end up with a 50k stack it turns out that in rails you can have several hundred method calls for a single request so we're talking about a shit loaded data that's getting copied back and forth every time a threat switches what's up man yeah 968 bytes for a single stack frame have that much space like I would say like anywhere on the orders of 256 bytes is like pushing it on the high end so the thing is that it shouldn't actually matter what you have on the stack it's just that the threading implementation is broken so it does so we're going to get into like how you fix that and like how it stops being a problem like very shortly alright so quick recap on what we got to so far how do threads in ruby work each thread has its own execution context you save the cpu registers and you restore them with set jump and long jump you have a copy of the vmglobals and a copy of the stack that's made by calling mem copy ruby switches between threads by executing until you get a signal from the kernel it saves the context call schedule to pick the next guy and then restores that guy's context and he starts going and between these two phases there is two calls to mem copy one call to save one call to restore but you know if you're paying attention to the beginning of the talk you're saying hey yo like you just said at the beginning that the whole point of green death is that they're supposed to be fast and cheap at the sacrifice of SNP you don't get multi-core but they're supposed to be fast but you know that much copying that's what we're seeing with all these traces and the code we just looked at is neither fast nor cheap so how do we fix this problem well we can fix the problem by just not copying stuff a stack is just a region of memory so why don't we just point the cpu at a region of memory that lives on the heap and then we want to switch to a new thread we just swap in a new register context on the cpu and we just do no copying so it turns out that's what we can do we wrote a patch to do that this is sort of a really really brief overview of these the important few lines that went into the patch there's lots of other stuff that has to go on behind the scenes too but just sort of a quick walkthrough of the zero copy threading patch and how it works so when you call rb thread start zero we allocate a thread stack by calling mmap and then when it's time to switch into that thread to execute there's just a little tiny trampoline of inline assembly this inline assembly is just going to swap the stack pointer out manually to switch to the other guy level how does that actually work it's not that crazy like you have your current executing thread over there on the left that's your current program and you have these other guys that live out on the heap when a signal comes in and it's time to switch you don't copy you just run that little piece of inline assembly to swap the stack pointer to the guy who lives on the heap and you just keep executing next time a signal comes in you do the same thing swap to the next guy that's on the heap you've now eliminated all the copies that were going on and threading implementation what does that mean in terms of a benchmark so luckily there's people who like to benchmark things so we stole a benchmark from the computer language benchmark game there's a benchmark called the thread ring benchmark and to illustrate just sort of the speed boost by this zero thread copy patch what we decided to do was we're going to grow the stacks a little bit and then a context switch just to show how intense this change actually is so we wrote a little function called grow stack grow stack calls itself recursively until it's called itself 20 times and then it yields to the block that was passed in benchmark looks kind of like this it's creating 500 threads we increase the thread stack size in each thread the threads all pause when they first get entered into and when they resume they decrement one from the number and as you can see number all the way at the top is set to some large value like 50 million and so what this is actually doing is it's just basically getting a bunch of threads together each thread is subtracting one from the total amount and then dequeuing itself laying the next guy run so we really benchmarking is the cost of context switching between lots of threads as they all work together decrement this like shared object so the results we passed in 50 million on ruby 186 standard no patches it takes 100 seconds that's about 2 hours in change ruby 186 with our thread fix it takes about 13 minutes so we were pretty happy what's up you have a question actually we're going to talk about GC we're talking again later about GC so there's like a an entire separate talk to answer your question josh so the way it's implemented right now in the Linux kernel for processes there's two models there's one model where your stack that causes a page fault the kernel sees that and maps more memory in to grow your stack but that's sort of the older model of processes in the Linux kernel the current model of processes in the Linux kernel is to basically set a system wide value with sysconf that's just called our limit stack and that's commonly set at 8 megs so the system on 32 bit system is set at 8 megs so the operating system will just say in advance like hey you're only going to get 8 megs to execute and then once you fall off that you're screwed and the program gets killed that's just like a problem in general right like you don't have infinite space do you want to jump in yeah we decided just to go with the easy solution and we added a thread dot stack size and so at the beginning of your program you can set how big you want your stacks to be and we decided we brainstormed about a couple solutions to growing the stacks automatically we decided it wasn't worth it in the end cool so sort of like you know we fix this thing it's fast it's cool everybody's happy like what's next well army thread schedule sucks thread switching might be fast but the schedule is still pretty bad this is kind of like a little bit a cleaned up version of what it looks like so it's basically just iterating through the entire thread list over and over and over again you know 5 plus times maybe more so complexity theory will say that constants don't matter they just drop out but if you have tens of thousands of green threads you know going over over these things several times can definitely add up so what's next what do we do now how do we fix this well we can rewrite the scheduler but that's too much work so we can just get rid of the scheduler and now we've come full circle we're background at fibers so we're back porting the fibers api to MRI to use the fast threading patch we just described so behind the scenes you just when it works you just create a thread you don't add it to the schedule list and then you schedule the thread manually with yield and resume so hopefully at this point you're asking yourself like where can I get all this awesome stuff can you get it on github I have two branches that has the heap stacking that we just talked about the zero copy thing and amon has a branch that has some fiber stuff in it if you don't like applying patches or building stuff on github you can use ruby enterprise edition RE is based on 187 it's open source it has our thread timer fix our zero copy threading patch can be enabled with the flag and it also includes some patches from the mbaery patch set which help reduce the stack frame size of important Ruby functions so RE is actually really fast and we're going to talk about it again later in the GC talk but you can get it there at rubyinterpriseedition.com that's it so questions and you can grab us on twitter link implementation so these patches wouldn't apply to 1.9 and as far as getting this stuff back for it to 1.8 the answer right now is no and so the reason why this will never be in 1.8 mainline is sort of several reasons one of which is that our code is platform specific we only support 32 bit and 64 bit x86 and x8664 platforms and ruby supports lots of other people like the human68k and all kinds of like weird CPUs and so I don't know ruby is a lot about portability and stuff which we're just not going to write instructions for all the instructions so we've submitted some of these patches actually the initial timer patch that we talked about that stopped the timer once the threat dies was recently accepted and it didn't play this point in 7 and the other patches like Joe said are very specific and that's why we were able to get them in RE because RE is simply meant for good excuses for deploying real applications on their servers it's like a huge philosophical argument because it's like what type of threading are you doing are you IO bound or are you CPU bound and what do you care about and just sort of like lots of tradeoffs in both directions so I don't think there's like one or an answer like yes it's better like the threading inflation is better here or it's worse there or whatever I mean they both have their pros and cons it's just tradeoffs I mean if you ran this red spring benchmark on 1.9 it would be comparable or maybe slightly slower than our patch and that's just because since you're using Carnal thread the Carnal is able to do basically all the magic we're doing in the Carnal and you don't have to do it each time so the slides we're going to have to clean them up a bit because of all the transitions so hopefully we'll do that tonight or tomorrow and output it to PDF and then they'll be on timeinthebleed.com my blog or on Twitter but yeah all the stuff will definitely be available once we have a couple minutes to clean it up these slides are actually 40 up far back yeah I'll tweet something and I'll hashtag it and you guys will be able to find it the logic is that you don't have to write platform specific code because different platforms the stacks are in different directions and so if you just say hey we're not going to write code free CPU we'll just copy them it makes it a lot easier and you can support things like continuations so our zero copy thread patch actually breaks continuations you can fix them but I don't care about continuations so um what's up do you want to take this it's like you execute you save state you can store that state away and you can resume it later um yeah so it's sort of like a thread or a fiber but you can save multiple times and jump back to that state multiple times so in a fiber if you save state and resume it's going to keep going but with continuations you can jump back thanks for listening guys and I'll catch you later for GC