 And I'm really glad to have someone here to talk about Jay Ruby. You know, Jay Ruby is a big deal, you should all be paying attention to it. And so I'm really happy to have Charlie Nutter here. He's at Red Hat and he's one of the, like the principal brain behind Jay Ruby. And he works on JVM languages at Red Hat doing a lot of cool stuff with them. And he's going to tell you about making Ruby high performance. Thank you Charlie. Alright, thank you. Alright, I have to ask the usual intro questions. How many people have ever used Jay Ruby for something? Wow. How many people are using it for something in production right now? Okay, fewer, but we're getting there. A lot more people have tried it out, that's good. So I'm going to talk specifically about Jay Ruby and how we're trying to optimize Ruby. But in general what the challenge is and some of the ways that we've come up with to try and find ways to make Ruby fast and actually make Ruby higher performance language than it is. So basic contact info. As mentioned, I am one of the JVM language guys at Red Hat, specifically in the Jay Boss Polyglot group that's working on Polyglot web server applications, web server services, stuff like that within Jay Boss. Mostly doing Jay Ruby right now, but hopefully after Jay Ruby 1.7 is done in the next month or so, I'll probably be looking at trying to take what we've learned to some of the other JVM languages as well. So what does performance mean? There's usually two metrics that people use when they talk about Ruby, if they talk about Ruby as far as performance goes. You know, the whole man hours being more expensive than CPU hours. Alright, so Ruby's not that fast, but we can write stuff in it really quickly, we can maintain it easily. Everybody's really happy when they're using it, and so that's one measure of the performance of a language. And by that measure, Ruby probably already is a high performance language, if what you're trying to do is just get applications written and maintain them well. But I'm going to be talking more about the other definition of performance, the one that usually translates to the bottom line of older applications that continue to grow, the straight line performance of running code, and how we can make that better within Ruby. So what is high performance then? Well, high performance is faster than something, faster than other Ruby implementations, that's been one of my concerns, trying to make sure we're doing it as well as possible. Other language run times, people will say Ruby isn't as fast as Java, or Ruby isn't as fast as .NET, or whatever else. And so it can't be considered high performance if it's not as fast as some of those other systems. Unmanaged languages like C, like raw C performance. We want Ruby to be as fast or faster than C someday, right? I mean, that's a goal, but is it a reasonable goal? Is it a goal that we actually need to reach? But really, high performance is kind of faster than you need it to be for whatever job you're trying to get done. Now, if you're just running a website, then you're running a Rails application, high performance may not be a huge goal to achieve, and that may not be that difficult to just be able to serve up much requests, especially if you've got other backend services that are doing a lot of heavy lifting for you. But we want to be able to use Ruby more in the system. We want to be able to use it everywhere and not have to fall back on other options all the time. So if what does fast enough mean? That's one of the things that people have said about Ruby for years. Apparently, Ruby 187 was fast enough. But now that Ruby 193 has come out and it's considerably better, much better performance than 187, Ruby 193 is now fast enough, and Ruby 187 is slow. So will they be lying then? Or will they just didn't need the performance to get from 193? Maybe they found other ways to work around it. So 193 is now fast enough. But again, there's going to become a point for any application that continues to grow where fast enough just isn't fast enough. And then you have to have some other fallback. And you hit this performance roll with your application that you can't make your application do what it needs to do in the amount of time that you have available. Trying to get the certain amount of work done with the available CPU resources, monetary resources and whatever you have. So at this point, what do you do? You can move to a different runtime. If you're on 187, you can move up to 193 and probably get a free performance boost for most stuff. Maybe get a performance hit if you're doing something with encoding sometimes. Or move to a different language. Maybe you fall back on doing C extensions, which is one problem that seems to plague Ruby a lot. People give up on Ruby itself and move to something else before they've really given Ruby its chance to do what it needs to do. And my claim here is that if you're not writing performance sensitive code in Ruby, you're probably giving up too easily. There are other options. There are other ways that we can make Ruby fast. And JRuby is one of them. So we're going to see how we're actually working on that. Probably the biggest dodge, like I said, is people falling back on native extensions to get performance. So native extensions for integrating libraries, existing libraries that do something you need, is not a bad thing. Not a universally bad idea. There's a lot of libraries out there that just don't exist in any Ruby form or on the JVM as a Java, JVM bytecode form. So calling up native libraries is not bad. What's bad is the way that they're implemented in CRuby right now. The CAPI for C extensions is very invasive. It has direct access to pointers. People access the internals of objects directly all the time. And so unless you're Ruby, unless you're MRI, you are very limited in being able to support this API. And even worse, MRI is limited in what they can provide as a runtime. So what are the things that we want out of Ruby as far as performance and making a scalable, high-performance platform? We want it to run code faster. That's one thing, which involves maybe getting a jit in there, getting some native code execution going on. Better GC, even through 193 people see GC times, GC pauses of multiple seconds in production applications. Overall, GC consumption of CPU is in the 10 to 20% range, and that's all just wasted cycles. So we need a better GC. We want to be able to run things in parallel. Rather than having to spin up multiple processes, we'd like to be able to just have a threaded worker system or an actor system within a given process that can use all those cores, rather than trying to coordinate across processes, serialize data back and forth, and use all the cycles that we use for that, then, too. And then big data. This kind of falls into GC, but as the size of a Ruby application grows on MRI, that GC performance hit gets bigger and bigger and bigger. It continually increases the percentage of time that it has to scan all of this data, even if only a small part of it actually gets collected. So these are the things that we want out of a Ruby runtime. Unfortunately, these are the exact things that we can't have with the way C extensions are implemented today. They need pointer access, so you can't move them around in memory. You can't move old data into a GC-free zone or a more permanent zone to allow it to be not considered for garbage collection. JIT-wise, it's much more complicated to build a JIT that works with the way C extensions are written right now. Parallel execution is almost impossible because we can't make any guarantees about that native code and none of those guarantees or promises have been made from the beginning. So there is no standard memory model for how native extensions should parallelize. There's no guideline for how to write code, and so nobody's doing it. Nobody's writing C extensions with parallelism in mind. And then big data, of course, all the same problems with GC and execution we've got. Trying to deal with all that data in memory if you're writing it all in native code and dealing with native access, you're passing a lot of information back and forth so there's a different approach. Rather than falling back to C and using C extensions, maybe we can try and actually make Ruby itself be a heck of a lot faster. We can improve Ruby and improve the runtime. So there's two options, really. You can build your own runtime, which is how Ruby193 went with YARV, a new bytecode-based VM, which has worked very well for them. Rubinius is another implementation that they've built their own runtime from the ground up. They have their own bytecode VM and then they have a JIT internally as well MacRuby, I listed both of these because MacRuby has written their own compiler on top of LLVM, but the whole object system is basically the Objective-C runtime. So then using an existing runtime, we've got JRuby where we run on the JVM, Maglev, which uses the gemstone smalltalk runtime to implement Ruby. IonRuby on top of .NET, of course, and there's lots and lots of other Ruby implementations that have chosen one of these two approaches. So the question you'd ask when you get to this point is, do you want to build something completely new or do you want to take something off the shelf, run an existing runtime and just build on top of it? The truth about making the VM is that it is pretty easy to make a simple VM. There's lots of examples of this. The early versions of Rubinius were very trivial, simple bytecode VMs. Why? Before he left the community, he did his little potion VM, which was another little VM implementation. Mark Andre Cornair did tinyRB, which was very small, under 64K of code or something, or under 16K, some really small number, and it was basically a VM for Ruby. But making it competitive with existing VMs, making it into something that can be compared against C or compared against native languages, as far as performance goes, is incredibly hard, incredibly hard to do. And so with JRuby, we obviously took the approach that we're going to take something off the shelf that works already, that we know is high performance, and we're going to use that. So look at the JDM, and there's like 15 years of engineering in specifically open JDK hotspot. It's free open source, it's GPL, and you can fork it and do whatever you want with it. And it is pretty much the fastest managed runtime, managed VM available. Definitely faster than Csharp.net. When you look at measurements and comparisons that people do between the JDM or Java applications or Java algorithms at least, they're comparing with C and C++. It's up to that level. Sometimes it wins, sometimes it doesn't. But it is by far the fastest managed runtime available. So we just picked the best runtime that we can, and we decided to build JRuby on top of that. Also, it's the best GCs available. If you look back for the past 10 years at research on garbage collection and how to make it fast, how to make it use fewer resources, pretty much all of them end up with a JDM implementation, at least as the proof of concept. And for example, OpenJDK, the hotspot VM, has I think six different options for garbage collection that use parallelism that run concurrently, that reduce pause times, guaranteed certain pause lengths, so on and so. But in general, if you're using a JDM, you're going to have the best garbage collection that's available for your system. So we've already got that for free. All the major JDMs are fully parallel threaded, so they've worked out all the difficulties of making that runtime run those threads in parallel for us. We just build on top of it. And then broad platform support. All the major operating systems, all the server operating systems have JDMs for them. Even the most obscure ones, HP UX, and we've got guys that run JRuby on AS400 for whatever reason. A lot of you don't even know what AS400 is, so... Yeah, well, I mean, it's a better choice than a lot of other options at AS400. I'll tell you that. Now, this rumor is slowly starting to die out, finally. Java was slow at one point. Before it got a jit, before they did a lot of this work on garbage collection. But Java is really fast. I mean, literally see fast if you write code that would kind of match up one-to-one as far as the work that's being done. The reason that people have this issue or this belief or the myth of Java being slow is because the way a lot of Java libraries force you to write applications makes those applications slow. Java is a terrible application development language. It requires you to do so much of this abstraction just to save yourself in the end that you build these gigantic abstraction frameworks. You have too many levels, too many redirects and dereferences of objects all over the place, and so applications end up being terribly slow. But algorithms, like simple sorting algorithms, mathematical algorithms, literally can compile down into exactly the same assembly code that you'd get out of C or C++. So we can have C performance running on top of the JVM if we know how to do it. And the bottom line of this is the way you write code is way more important than whatever language you use. The runtime is going to play into it sometimes. If the runtime has its own built-in performance hits, that's always an issue. But the way you write code on the JVM or anywhere else is much more important than what you write it in. Okay, so that brings us to JRuby. So it's a Java and bits and pieces more and more in Ruby implementation of Ruby on top of the JVM. So we're bringing Ruby to the JVM. We're also bringing the JVM to Ruby, and ideally trying to be as close to one-to-one compatible with regular Ruby as possible. JRuby 1.6 was released with pretty good Ruby 1.92 support. JRuby 1.7 will have much more solid 1.93 support. We also still do have very solid 1.87 support. So if you've got stuff that still depends on 1.87, throw it away. So it's exactly the same memory and threading model as the JVM. We don't change anything. We get everything else for free. And we do eventually JIT compile Ruby code into JVM bytecode, which it can then take and turn into native code. And this is what I do. I sit and watch this process all day, make sure that it's optimizing the right way, and then try to figure out ways to make it better. And so that's really all there is to it, right? That's what JRuby is, and here we are. We're on a much longer road, unfortunately. Ruby is a challenge to optimize. Nobody has found all of the magic solutions to make Ruby fast in all cases. We've done a lot of work over the past five to six years in JRuby to make it run Ruby code as fast as possible, but there's still things that we haven't figured out, still things about Ruby that defy optimization, at least with what we know today. So there's getting the interpreter to run faster, ideally getting it to JVM bytecode as fast as possible and making sure that bytecode is optimized well, making sure that the JVM then takes that bytecode and optimizes it even better down into native code. And that's actually a small part of working on JRuby, making sure that all of the string methods and array methods and hash methods are as fast as possible and don't have obvious egregious performance bugs in them. And then just kind of repeating this process over years. Here's the, let's see, commit graph off a GitHub for JRuby, going all the way back to 2001 when the first couple commits came in from the original contributors. When you see there's a little burst of activity at the beginning, they got some basic things working, and then it was probably incredibly slow at that point. Probably weren't too excited about continuing on with it, and it sat dormant for a while. And then my co-conspirator, Tom Enable, got involved in 2003, 2004, and he started working on it. Then he started to pick up 2005 or so, I got involved, started rewriting the interpreter, trying new ways of doing stuff. And then in late 2006, where things really started to pick up, that's when Tom and I started to work on JRuby full-time. We went to Sun Microsystems, they were very interested in JVM languages at the time. We went there and started to work on things full-time, and then it really started to pick up. In 2008, we kind of had a reasonably good interpreter. I started working on the compiler to turn Ruby code into JVM bytecode, continuing on all the way up to today. The activity level has stayed pretty much constant or rising over that time. And this is all just continuing to find better ways to run Ruby code, get Ruby 193 and other features working well, and bring the best Ruby possible. But this is a long process, literally six years of the big part of this work, and there was work done before that too. So our goal usually is to try and align Ruby execution with what the JVM wants to see. So we want to have Ruby arguments just be JVM arguments. We don't want to have them separate structure that we have to pass them around in. Making Ruby local variables just be JVM local variables so that when it optimizes them down to registers, we actually get that for free. Avoiding all this extra framing and method information that's off-stack on some other data structure, and avoiding all of the between-call nonsense, looking up the method repeatedly. We obviously didn't want to go and do a hash hit every single time. And God forbid we do a hash hit on an entire hierarchy of classes every single time. We want to eliminate that and avoid all that extra overhead. But the bottom line, the golden rule of optimization is eliminating unnecessary work as much as possible. And what unnecessary work do we have in Ruby? Well, there's a lot that we could be wasting our time on. So every modular class is basically a map or a set of maps. There's a map from the name of a method to the body of it, the code that goes along with it, a map from the name of a constant to whatever value it has, class variables, very similar structure to how methods work in the hierarchy. Instance variables traditionally have been implemented as just a map on every object. Newer improvement implementations like 1.9 and JVM Ruby have ways to avoid doing a hash hit every single time you go into that object. All this stuff is basically wasted cycles. If we can't find a good way to cache it, we have a good way to optimize it so we're not constantly hitting a table somewhere to do a lookup. So for method lookup, what do we do to optimize this? So within each class or module, there is the map that lists all of the methods at every level in the hierarchy. Methods are retrieved from that class or the ancestors by just walking up the hierarchy, finding that name, and then we're done. We don't go any further because that obscures any methods above it. We've got in JVM a serial number at that point that says, I am at this version of the full class and I'm caching this method. If any of these classes changes or any of the classes above me, any ancestors change, throw this away, we need to do another lookup. And we do that by having a weak list of all the children. So we have our hard link going up the hierarchy and a weak link going down. So any method anywhere in the system changes, we can tell all the classes below it that they need to flush out their caches, something's changed, you need to look up the methods again. So graphically showing how this actually works. So you look up a method, we're doing a 2S on our rootiest class here. We'll walk all the way up to Fing, which is where we have it actually implemented. We've got that method in hand now. Now this is what we want to avoid doing every single time. We don't want to have to walk this hierarchy every single time we do this call. So that 2S method then gets pulled down to the bottom and cached within the rootiest class. So at the rootiest level, we know that that method is there. We only got one hit at least to get that method. And then if anything changes up at the top, so we add a new 2S, we reimplement it, we need to go down that hierarchy, flush everything out, and then our 2S goes away and we've got to do another lookup the next time. What about constant lookup? Constant lookups are a little bit more complicated as far as how you find them. They can be found within a class hierarchy or based on like lexical scoping, which modules surround this piece of code. So because of that we only have one global switch that says whether constants have been updated somewhere. It's too difficult for us to have a weak structure that points back to every place a constant might be accessed. So we cache that at the point where you use it, we have a serial number that says this is what the versions of all constants in the system were at and whenever they change, we have to flush that out. Instance variables, the JRuby way, each class actually just holds a table of offsets into the object. Rather than going to the class and saying give me the value of foo, we go to the object and asking for the value of foo out of some hash table, we go to the class, say where in the object, what offset into the object is the foo variable stored at and then we can save that value for next time. As long as we're still accessing the same class, we can go straight in, we don't have to do a hash hit. And that saves us a lot of overhead that is typically there in most Ruby implementations. And so the bottom line of optimizing Ruby is making these things fast and along with closures which are still kind of trying to improve the performance of making calls as fast as possible, making constants free ideally so that once you've looked it up you don't pay any more cost ever again, making these instance variables as cheap as possible like just indexing into memory somewhere. And that's where a lot of invoke dynamic stuff comes in. So what is invoke dynamic? So is it about invocation? Well it's an obvious thing but it's not the only thing. We can use it for doing fast invocation but there are many other uses as you'll see here. And we're using it for all sorts of aspects of Ruby that have nothing to do with method calls. What about dynamic? Well dynamic typing is a common reason for using invoke dynamic but things like instance variables which is just a growing list or constants which are essentially lazily defined not static constant values in the system. They're not really dynamic languages or dynamic dispatch, dynamic typing related. They're just concepts that we need to be able to do at run time rather than being able to do all at compile time. So a little JVM 101, how many people have used the JVM, run job applications, anything like that? Okay so most folks have touched the JVM at some point. So there are 200 op codes currently in the JVM. 10 or 16 are what I would call data endpoints. And that's things like invocation, different types of method calls for virtual methods, interface methods, static methods, and then super or constructor calls. Field access, getting data out of an object or getting data out of some static field somewhere. And then getting data in and out of arrays. And so pretty much all Java code just revolves around these data endpoints. Everything else is stack juggling and basic math and flow control kind of stuff. But at the end of the day if you're going to accomplish anything you're going to deal with one of these methods or one of these operations to put data somewhere, get data somewhere, or make a method call, some kind. So the problem here is that this is our little bubble of what we can do on the JVM with the available operations. We've got our basic data endpoints. We've got all these other bits and pieces that help wire code together. And unfortunately if you ever stray outside of this line, stray outside of what the JVM can do, you're stuck. And you have to basically use this as the only functions, the only features of the runtime to implement a language or implement a library. You're kind of just stuck inside there and you have to back off. And it's frustrating to me sometimes. You can look at a runtime like Parrot, which had 10,000 operations or something. They just kept adding new ones. Parrot was the original plan for the Perl 6 VM, but they also wanted it to be the ultimate dynamic language VM. And it had thousands and thousands of operations. So why doesn't the JVM just have millions of op-codes that can do all these other things we want to do, like dynamic dispatch, like lazy constants and all that? Well, see the thing is if we invoke dynamic, we actually can get around a lot of this. We generate code with invoke dynamic. The JVM actually then just asks us what to do at that point. So at runtime, it kind of bootstraps our logic rather than going to one of the standard JVM operations. A diagram kind of helps to show how this actually works. All right, so we have our little switchboard, which is essentially the JVM and all of our code in it. So we've got our invoke dynamic up at the top. We're making a call, say, we're calling a 2S method in Ruby. And we insert our invoke dynamic into the bytecode at that point. That goes to what's called a bootstrap method in invoke dynamic terms. A piece of code that we've defined in J Ruby that says how to find the 2S method, how to look it up and make the call. Our bootstrap method then returns a handle to that method. We've got the 2S code. We give it back to the JVM and say, okay, this is what you're actually looking for at this point. The JVM then can go right to that target piece of code, make the call, and the magic of it is that after we've made this loop once, all of that stuff can just go away. It all disappears. We actually have a direct binding at the JVM level from our call to the target method. And then it can optimize it exactly the same way as any other JVM language, Java, Scala, whatever else, for Ruby. All right. A little bit more detailed example here. So here we have an example of dynamic invocation. How are we doing it on the JVM with invoked dynamic? All right. We've got our call to foo, and that basically goes to the JVM level and says I need to do a foo call on this object. The JVM then uses our logic, the JVM Ruby code, to go into the method table, find the foo method and grab it and hold it in hand. We'll make that call, actually invoke the code, and then it puts it back all the way in the call site to our foo call and saves it for us so that we don't have to do this again. Now, constants work largely the same way. We've got our VM operations and we've got our call site, which is just accessing a constant. So we go to the JVM, we say I'm looking up this constant, figure out how to find it. The JVM calls back to J Ruby and says, how do we do this? What's your logic for looking up this particular named constant? We get that value and we say, okay, here it is. Here's how you... Here's the value that you need. You can validate it, whether it's ever going to be stale in the future. And that value then goes all the way back, gets inserted into basically the native code at that point and never accessed again until maybe if you're doing something silly and defining a lot of constants at runtime, it might come back. But it essentially is free at that point. I actually had a benchmark where I would access the same constant a hundred times in a tight loop. And before invoke dynamic, before we could tell the JVM exactly how to bind that constant in and make it permanent, that was a great benchmark. It showed me how much cost there was involved in looking up a constant. Once invoke dynamic stuff got in place, it became a completely worthless benchmark because the JVM would see, okay, we've bound this exact same static value into the code a hundred times. I'm just going to throw 99 of those away and only use the last one because the other values aren't ever even touched. It actually can optimize it down to a real constant access, which I don't think any other Ruby VMs can do at this point. So instance variables, instance variables are kind of interesting. So we've got our VL operations. We've got the site where we actually access the instance variable. We go to the JVM and the JVM asks us in our offset table, where is this, where is the bar value or the bar entry in this object? And here we've got our little table, full is at zero, bar is at one. We can take that back all the way to call site and as long as we're always accessing the same object, we just have to say, give me the first variable in that table, in the table for this object every time and it's just a one or two hop dereferencing memory rather than doing the full hash hit that we would do in, for example, Ruby 187. We access the object, all of this stuff goes away and it's basically free from then on or two memory dereferences at the most part. So invoke dynamic basically lets JVM teach the JVM how Ruby works so that it can optimize it like anything else, any other language on the JVM. Allows us to work around those 16 or so operations that we were limited to before. We can do everything those can do and a whole lot more. Alright, so how do we know that we've succeeded in optimizing Ruby? How do we know that we're getting closer to the goal of high performance Ruby? Well obviously we can do benchmarking, which sometimes works, sometimes doesn't. We can actually monitor what the JVM is doing and see if the app code is optimizing the right way and then we also count a lot on user reports, people trying things out, letting us know if it's slow or fast or if it improves their case or not. So first of all benchmarking, benchmarking is really hard. It's especially hard on an optimizing system like the JVM or any of the other optimizing run times because performance changes over time. Do you take the first 10 results? Do you take the last 10 of a thousand results? Do you ever know exactly which ones are important? As I mentioned with that benchmark of accessing the same constant multiple times, if you get really good at optimizing code, your benchmark is totally useless, right? So it may optimize completely away and you end up with a big zero for the performance number for that, which is interesting, but it doesn't actually tell you about a real application and whether it's going to optimize the way you want it to. So really the problem here is that small systems like you would have in a benchmark are completely different from large systems. Different memory, different code patterns that they execute. The shape of the system is just completely different and so benchmarks are kind of, they can obscure what the actual performance is or lack thereof. All right, so let's look at an example of perhaps bad benchmarking in action. We're going to benchmark how much it costs to call basically an empty method and the method just returns itself. There's absolutely no work of any kind and see how fast it runs. And this is a favorite of new Ruby language implementers. Let's see how fast we can call an empty method. If we do that really well, we've solved the Ruby problem. All right. So... Here's numbers comparing Ruby 193, JRuby and JRbooth invoke dynamic. And it's like, oh my gosh, it's so much faster, right? But what is actually happening here? Is this actually a useful benchmark at this point? Let's go through some observations from this and see what we actually have. So first of all, one slow runtime really screws up the scaling of the table. You can't really actually tell even what the JRuby performance is here because it's so much faster than Ruby 19. And I actually... This is kind of mean, but... I did apologize for this afterwards, though. But I mean, that's the thing. When you're doing these comparisons with another runtime, sometimes you have to do a different way of looking at it. Raw time doesn't mean as much as some norm. And of course, we're benchmarking an empty piece of code here. How often does somebody publish a new server performance number for Ruby server's Unicorn or anything else that's basically doing an empty request? Shouldn't that just be zero? Shouldn't that be zero on all the run times? It doesn't actually mean anything. We can call empty methods really fast. But how many people just have applications written entirely is not many. So the other thing you might notice from this is that invokeDynamic doesn't actually seem to do a whole lot for us here. It's cutting it down, but we're already pretty fast. So we kind of have to go back to JVM Optimization 101 here. So it's going to compile the code after about 10,000 calls most of the time. If it doesn't reach 10,000 calls, generally it's not going to actually jit-compile it. It'll inline up to two different target methods at a given point, normally. With invokeDynamic stuff, that's kind of configurable. But only a handful of targets. Because usually most calls will have a single method that they'll ever see. It's also optimistic. So it makes very aggressive decisions about how it optimizes stuff that may turn out to be wrong later. But this is a key reason why small code and benchmarking a small application that contrives synthetic algorithm can be very, very different from benchmarking a large application. Because it'll make those optimistic decisions for this small piece of code and it'll never be wrong. Whereas it might be in a larger system. There might be more methods. There might be more classes. And so, again, small code is very different from large code. If you're going to test out performance, you want to be testing it on something substantial. Ideally, you want to be testing it on what you actually are going to run in your application and production. Don't take these benchmarks as it's going to be fast every time. So back to optimization inlining. So the key optimization that the JVM will give you is inlining code one method into another. Basically, it takes a call site like our 2S call, takes the code that goes along with it and sticks them together in the same ball of code internally, figures out how to optimize that as well as possible as a single unit, and then generates that machine code. Rather than generating this piece of code on the left that does the 2S call and branching into memory somewhere for the 2S on the other side. So you treat them as basically as though 2S lived in the original one, and we avoid that call overhead, we avoid the branching memory which throws out the CPUs a little bit, and we can optimize things like variables that are passed in that are never used, and then maybe the values that were calculated for those variables, those calculations don't even need to be used. Things along those lines that we can do more of if we have more visibility in the system, which is what inlining gets us. And I mentioned optimistic, and so say we have a system and the only method that it calls or the key method that it calls is foo, and that's it. And that's what we're benchmarking and optimizing. Well, all dynamic calls in that small world basically are foo, so every dynamic call must be foo forever, and we'll just optimize it as if it's always foo. And that's kind of what happens in this case. We optimize it for the only case that the JVMatter actually sees. So let's try and skew this. We'll play with the JVM a little bit. So before the actual benchmark here, I've got a couple other methods, and we'll do a bunch of calls to these other dynamic methods basically in the same way. And then you actually start to see some of these effects of what large systems change versus small systems. So here we have Bench 1 and Bench 2 on the top. The top line is the second benchmark where we're throwing the JVM a wrench. And you can see that it skews the performance considerably. It goes from 0.3736 something up to about 0.53 as far as the performance goes. Now we're actually seeing what a real system would look like as far as optimization. We have to run these sorts of benchmarks. We have to make the benchmarks more complicated or make them match our actual code in run time. At the bottom you can also see even the voc dynamic version it does degrade a little bit. We're doing a better job. We're giving the JVM a better view of what actually is happening in the full loop versus the other loop but it still does degrade a bit. And this isn't just JRuby. This is kind of common to any optimizing run time. Here we actually have Rubinius numbers. The top two lines are Rubinius and you can see how much that additional code skews the benchmark in Rubinius as well. Changes the way that it optimizes the actual benchmark. And that's more that's probably closer to what a real system would look like. All right. So to kind of know what happened we optimized that earlier loop and we optimized all those bar one and bar two calls and then when we got to full we're like okay now there's this completely different dynamic call. We've already seen a couple, we've already seen two other dynamic calls. I'm giving up at this point I'm just going to make them all be raw calls in memory. I'm not going to do the same optimizations anymore. So the assumptions change and the performance ends up looking different. So that's why benchmarking is not always enough. And I run benchmarks all the time but at the end of the day I kind of have to actually look and see what the JVM is doing as far as optimization. We can look at whether it's compiling code first of all. We can look at whether it's inlining code together so we can get a better optimization picture and then you actually do have to sit and read what code it actually generates at the end and I've spent a lot of time looking at the assembly code that JVM spits out full Ruby code to figure out how to make it faster. This is kind of a difficult one to read but all you really need to see here is the fact that we've got the foo method here this file file method up at the top I got these in the reverse order there. So the file method at the top is basically the root of the script and specifically this is the zeroth block within that script and this is the JVM's compilation logs inlining logs showing that along with all of the invoke dynamic crud that's in between it inlines all the way into foo there and treats it as one unit so now this will optimize down and pretend that it was all one piece of code to begin with now everybody knows assembly right? So here we're actually decoding what code the JVM generates for the foo method so it's basically a blank method that just returns itself there's our foo method that's the mangled name we use to generate it on the JVM here's the actual need of it here's the actual assembly code for it so we do some stack manipulation stack pointer manipulation the move rcx to rax is basically moving the zeroth that was passed into this call into the return register ax we do a little bit more stack cleaning the test there is asking the JVM if anything needs to be done like gc or de-optimization or whatever else and then we return to read to figure out whether the JVM is optimizing this right if I saw something in here more than move rcx to rax something's wrong because all it should be doing is returning the self that just passed in alright so let's go with a little bit different version of this that again is starting to get closer to a real system now instead of having one loop at the top level of the script we're going to have an inner loop within an invoker method that we'll call multiple times on the outside so getting a few more layers a little bit more complexity the same amount of work trying to kind of represent a larger system that actually is doing multiple method calls at multiple levels so we're going this one and the results are kind of surprising so the blue line is the first bench which was our kind of wrong bench with invoke dynamic the second one was the one where we threw the JVM off a bit and the third one is where we're splitting things up into separate methods that the system would giving the inner code a little bit more time to optimize maybe calling it 10,000 times so that all the JVM optimizations will fire and that's the best performance yet so now the system has gone back the other direction again and this is why it's so difficult to get reasonable benchmark results from synthetic algorithms like this so these benchmarks are synthetic now I have some the next few slides are going to be more interesting more real world benchmarks but please don't run things like empty benchmarks please don't run fib which I'm definitely guilty of doing a lot figure out what it is in your system that needs to be faster and try that alright so the first one here is a pure Ruby red black implementation just using regular Ruby instance variables and accesses and what not builds up 100,000 node tree of random numbers deletes them all, builds it again searches for particular elements walks it in different ways just kind of exercising a simple data structure written entirely in Ruby that maybe if it wasn't fast enough you might have to drop to C for it this is the kind of stuff that we want to make fast these are the numbers on JRuby193 Ruby19 runs this in about 4, 4.5 seconds or so and you can see the red line is JRuby before invokeDynamic also on Java 7 but the really good numbers the JRuby plus invokeDynamic numbers down at the bottom end up being in like the 0.8 second range considerably faster than any other runtime I think Rubinius comes close at maybe like a 1 second or so maybe 0.95, something like that but there's a lot more that we can do there's all sorts of cases in this code that are not optimizing as well but you're looking at several times faster 3 or 4, maybe 5 times faster than Ruby193 already alright so the math is another big one that we try to optimize so I've got two little fractal generators here one is basically a Mandelbrot generator it has some integer loops that do the iteration process and then a lot of floating point math the other one is just one I thought was fun because it's a Julia set generator that uses Ruby flip flops how many people have ever seen a Ruby flip flop? it's actually a syntactic structure in Ruby how many of you have ever used a flip flop for anything? there's a couple there's a couple here if you're used to like Cederock that's kind of where they come from but you'll see the code in a minute I don't understand what it's actually doing but it generates beautiful fractals so it's really a fun benchmark to run alright so there's the output from the Mandelbrot generator nothing too exciting here is the flip flop based fractal generator and I have no idea what this is doing down here I was looking for a flip flop benchmark at one point and I found this and I'm like wow that takes the cake right there I have no idea what's going on the basic idea with flip flop is that the first time it's encountered it's true and the next time it's encountered it is true if the left side of the dot dot dot is true and the next time it uses the right side of the dot dot dot and so you can use it to like switch things on and off like if you're parsing a file or a comment line you switch into comment mode for a while then you switch it off again when you get out of comment mode there's probably better ways to do it or at least more readable ways to do it but I mean I have no idea what this is it's exciting but it generates really cool stuff and it actually generates this iteratively so it like builds out from one side and sort of crawls across the screen five times at the end I'll run it as you can see it and then here's the numbers for the first fractal benchmark and the thing you'll notice here is that the numbers for invoke dynamic are basically identical that's because by and large for math we've spent a lot of time trying to optimize it in JRuby itself and the logic isn't really that different with invoke dynamic in play same thing with the Julia set result much faster than 1.9 but invoke dynamic is not doing a whole lot for us here really helps with object access instance variables, constants, things like that and for anything that has a lot of method calls that aren't math so what about Rails? a lot of Rails people here obviously where Rails is kind of still a mixed bag at this point there's a lot of code in Rails a lot of work to optimize it and it's a long tail it depends more on optimizing core classes like now we have to make sure they're all in coding logic for Ruby 1.9 as fast as possible there are some significant gains for some people that run with JRuby depending on the size of the application depending on some JVM settings you can definitely see big improvements as far as the performance goes but there's work in progress here so what's next as far as high performance Ruby in JRuby? we want to continue to expand where we optimize right now if you do mismatched arity calls or restart calls we don't optimize those very well and obviously I missed something and I read through these too right I got it thanks so super calls we don't optimize right now closures I mentioned we still are working to optimize those and there's work to be done and then what once we've got all the basic calls working well then it's kind of up to you guys and how you write Ruby code what you do with it is define method something that we really should look at making optimized maybe it's used heavily within Rails and with a lot of other frameworks method missing do people do a lot of method call throughs that don't generate a method for the next time so you're constantly hitting method missing well we might be able to find a way to optimize that as well make it forced to inline make it actually be almost as fast as doing a regular call it will take work though respond to we do a little bit of optimization for right now with certain types of caching so that if you're calling the respond to with the same symbol every time and respond to it hasn't been overridden we can just give you a true value back immediately proc tables applications that use tables of procs and dispatch that way so there's a lot possible but we need to know from you whether they're worth it to spend the time on optimizing them alright so wrapping up the future so JV is going to continue to get faster we've taken a few steps back in 1.7 now to try not back from performance but back from working on it to get 1.9 support get functionality out there and we'll return to performance soon there's going to be a lot more improvements to invoke dynamic at the JVM level too those guys are using dynamic languages like JRuby, Javascript as their use case their test case for optimizing invoke dynamic right now and I hear from those guys every week or so with new performance numbers or asking for stuff that needs to be optimized yeah so if we can't compete with what the JVM is doing we're not done yet and we're going to continue to work on Ruby as much as possible to try and get as fast as all the other JVM languages and you know the JVM and JVM are still fully free and open source from top to bottom so give it a try don't be afraid of it and let us know what you find out performance wise, compatibility wise etc so that's it, thank you okay apparently timed it exactly 45 minutes