 Hello again, so the next talk is about Jeremy and Truffle and how they can make it nine times faster than MRI. It's presented by Peter. He works for Oracle Labs and is one of the maintainers of Concurrent Ruby Jam. So please welcome him. Thank you. So hello. This was mentioned. I work for Oracle Labs, which is a research group within Oracle. And we work on Growl, Compaddle and Truffle framework there as well. And I'll be talking about what makes Truffle Ruby run nine times faster than MRI. And we recently changed names from Jeremy plus Truffle to just Truffle Ruby. I would also like to make sure at the beginning that you understand this is a research product, so don't buy any Oracle stock space on anything you hear here today, even if you like it. So I'll start talking a little bit about OpCard, what it is. Then I explain what Truffle Ruby is, and then I follow with explaining some of the optimizations we do and how we are able to run OpCard nine times faster. So OpCard is a NES emulator, which means the NES is short-cut for Nintendo Entertainment System, which is an old console. It has 8-bit CPU and picture processing unit and two kilobytes of RAM and RAM. It was released in 1983. And you can look up the benchmark on GitHub. It was created to drive the improvements for Ruby 3 to make MRI three times faster. And the benchmark itself runs one master game. So we'll start by, that we have a look, even though the benchmark was not developed primarily to play the games, but you can do it anyway. So we can play Landmaster on, this is on MRI 2.4. You can see it's a little bit laggy. So it takes about 40 seconds before I solve this simple, the first level. So just bear with me for a few more seconds. Almost there. You can see it's around 13 or 15 frames per second. But if we run the same thing on a tougher Ruby, it will be much better. So I'll be able to solve the same level in about 20 seconds, because just the UI works much nicer. So that's it. Of course, this is not, this is very subjective way how to compare the implementations. But I wanted to show you that it's actually, you can actually play games on tougher Ruby with opt carrot. So let's move on to results. So these are the results published in the opt carrot readme on GitHub for all the different implementations. They are run up until 180 frames, which is not enough for a tougher Ruby to fully warm up. And also the tougher Ruby is not part of this benchmark of these results. So we've run around. So I'll show you results for these four implementations. MRI 2.0, which is the baseline, the latest one, JRuby with Imonk dynamic and Ibot and on server. And of course, tougher Ruby on Graal VM 0.19. And I'll be using 6000 frames to let it run much longer. So at first I've zoomed in at the first 600 frames. The X axis are frames and Ipsone axis is frames per second. And you can see that the MRI implementations are pretty stable from the beginning. The 2.4 is slightly faster. The green dots represent samples from JRuby. You can see that it takes a little bit to warm up and then it's stable around 50 frames per second. Truffle Ruby has a longer warm up, but it's because we didn't really look at it yet. So don't take this as a final state how fast or slow we warm up. But we don't stop at the 50 frames per second. We go up to 110. And if we look at the whole graph for 6000 frames, you can see that Truffle Ruby is after all the optimizations are done, it goes up to 240 frames per second, compared to MRI, which is down around 20 something. So this is the comparison by taking the last 1000 samples from the previous slide. So let's have a closer look at opt-carat. So if you profile the code, you can see that there is one really hot method, render pixel in the PPU, the picture processing unit class. So we will be looking at its source code. And there is also a group of nodes which are representing the memory mode access, memory mode access and instructions in the CPU class. So we'll be looking at those as well. So this is the source code for the render pixel method from the PPU. As you can see, it does a lot of instance variable accesses, some integer operations and also accesses arrays and upends arrays. So we will be later looking at how we optimize instance variables, for example. The source code for CPU is not just one method, it's composed for many methods. And there is also a constant, the dispatch, which is an array of arrays, which is used to map the op-code, which is just an integer value to method name and the arguments of the method, which should be called for a given op-code. So for example, if the op-code is one, then it looks up the first array in the position one from the dispatch and it's called with the send method. So it will go to the error op method and there are two other sends where the first one prepares the environment for reading from memory in a given mode and the second send just executes some of the instructions. So before we go to talk about some of the optimizations specifically, I will also explain a little bit generally how Truffle Ruby works. I forgot to mention at the beginning, if you have, if I explain something poorly, please ask immediately, don't wait at the end. So Truffle Ruby is Ruby implementation. We aim to be very highly compatible with MRI, which means that we will be able to run the C extensions, not just run them, but we will run them as fast as MRI does. And of course, the aim is performance, as you see in the results. So which also means that there should not be any more need for writing C extensions because code you write in Ruby should be as fast as, almost as fast as if you would write a C extension or in Rust or whatever. The Truffle Ruby implementation is using Truffle, which is a language implementation framework, which is a self-optimizing KST interpreter and I'll be explaining that. And it uses Gral compiler to just in time compile the Ruby methods. So I'll start by explaining what's abstract syntax to me. So if you have a simple method foo, I can express that as a tree where at the beginning is the call to the toString method. The left branch goes to the receiver, which in this case is called to the plus method on receiver six with first argument seven. And then the right bridge at the top is eight, which is the argument of the toString method. So we can turn this into interpreter very easily by representing each of the nodes with a class. So for example, we start with the return nodes with the numbers. So we can implement it just by creating this literal node where it's initialized with the value. And when you call the execute method where you are interpreting the node, it will just return the value which was used to initialize the node. Now let's have a look at the method calls. So now we have to create a node with a little bit more information. We need the name. We need a node which is representing the receiver and we need an array of nodes which are representing the arguments for the method. So those we assign to instance variables and then we execute this node. We first have to execute the receiver node to get the actual object, the receiver. Then we look up the method by name and then we can call the method with the receiver and with the arguments after evaluating the nodes representing the arguments. But as I said, truffle is self-optimizing, abstract syntax to interpreter. So if I continue with a simple example, we can for example do node replacement. So we had, what was it called? We had method call node. So we can do monomorphic cache, simple monomorphic cache here by creating initialized method call node which when it's first executed, it will look up the method and then it replaces itself with another node, cached method call node. But there won't be just the name of the method, it will be already the object representing the method to be called. And after it's replaced, it will be immediately called. And if you look at the cached method call node, class we see that in the execute method the method is immediately called. There is no more expensive lookup of the method through the classes and modules of Ruby. This is of course a very simplified example. I won't be explaining how we do the optimizations and stuff like that, how we deal with when the method is redefined and stuff like that. Of course we handle that, but I'm skipping it. But even though the nodes are specializing for the code they are executing by, for example, doing caches and stuff like that, it's not enough to be able to run that fast. We need to be able somehow to compile this. So for that we use partial evaluation which basically eliminates all the overhead of executing the nodes. So we do that by trying to execute as much as we can because we use all the constant information from the nodes. So if we have a look again at the cached method call node and the little node we add in this example attribute final, we will be just assuming for this example that Ruby has final methods which means, sorry, final instance variables, which means that basically after you set some value to the instance variable it cannot be, it can't be ever changed which is important for the partial evaluation. So we take the, we represent the code from the previous slide with these nodes and we start to partially evaluating it. So we start by copying the body of the method of the top node for executing the two-string method. But now because we've marked some of the instance variables as final we know that the method receiver arguments are final, are constant values during the compilation. So we now can expand that to just the array which contains the one little node which contains the value 8. I am using the brackets here to represent objects which are already created. This is not an instantiation of that object, it's just representation of the object which was already stored with the argument instance variable. So now we can get rid of the array. We can execute the method execute on the return node and if you remember it's just a reading of the value instance variable inside the return node which is constant. So we can just replace that with 8. Now we will evaluate the receiver. So I do a little substitution here to make it easier. So again we replace the execute method with the body of the execute method on the cached method call node and again we replace the arguments with 7. Receiver is in this case a certain node for 6. So we replace that for 6 and the method is again because it's cached method call node. So the method was already looked up in the initialized node before. So the method is again just the object representing the method itself so we can replace that with the direct call to the method on the integer class. So we now put that back and the last thing we have to do is replace the first method which is the instance variable from the top node to get just two direct calls. So we've eliminated all the execute methods of the nodes and we are left just with the bare minimum we have to do. So this is the compilation unit basically which will then is fed to Groud compiler and Groud produces highly optimized machine code for this compilation unit. So next time when you call this method it will not go through the interpreter through calling the execute methods on the nodes while it will call the compiled code for this. Okay so we actually don't write the Ruby nodes in Ruby. We use Java for that. So this is a small example of our actual implementation of the plus operation on Fixnum. We use DSL a lot which means in Java that we use annotations heavily and annotation processors. So for example the code on the left just means that for doing plus operation on a Fixnum we have at least these four specializations where the first one is used for doing addition of two small integers without overflow. If that fails with the aerometric exception then the second specialization is used which causes the values to long to avoid the overflow. If we are adding two longs then we have to check if we are getting again the aerometric exception because it can again overflow. If it does we have to create a big integer. The way how this is implemented is that the annotation process actually generates nodes for each of these methods for us. So based on which of the specializations is used then they are added to a chain which is then called. The execution methods are called on the chain and then it of course goes through the partial evolution to the compilation. This is also an example of how we do a type specialization. So if you use in your code only small integers only the first method will be triggered for this so it will end up compiled as a single instruction for integer addition and one jump overflow to check that it didn't overflow. That's it. So that was the basics of Truffle. So let's talk a little bit about how we optimize how we do instance variable access because we've seen it a lot in the source code of the methods of the op-curve benchmark. So the Ruby objects can grow and shrink which means that a new instance variable can be added or removed from the object. So because of that we use the following representation. We have a dynamic object which is representing a Ruby object and it has a few fields to store some values of instance variables and it also has an airway to store any additional instance variables that are not fitting in these two fields and the airway of course then can be grown or shrink as needed. And we have a companion object shape assigned from each dynamic object. It describes where each instance variable is stored in the dynamic object. So in the shape we can look up that for example the name instance variable is stored at a certain offset of the dynamic object in the fields too. So because of that what we want to do is each time we want to read an instance variable we don't want to go through shape looking up where it is the actual instance and reading it. We want to somehow cache where we should read the values of the instance variables from. So this is again a small example of how it looks like in our implementation and what it does is that if there is in a quote a read from instance variable at the beginning it's initialized and then when it's first time it reads something it caches the shape of the object it's seen. And it caches the shape and also the property which stores the offset in the dynamic object when the value is stored. And the shape and the property is final. So if I switch to a graphic representation we start with the initialized then when we read some for example instance variable name from an object we cache that it has a shape with the instance variable name and that the property for describing the instance variable name is cached as well and within that there is stored the offset the final offset of the value of the instance variable which means that next time when we are reading the value we first just check that the shape of the dynamic object is equal to the cached shape which is very cheap and if it does we can immediately read the value from the final offset stored in the property. So we don't have to go through the shape and looking at the property and so on and stuff like that. So if you have very simple method read we just read one instance variable. So we can then have a look at a graph of IR internal representation from Graal, this screenshot from the IGV and we can verify what I was just trying to explain on the previous slide. So at the beginning these nodes are just reading arguments from given to the method and this one is self. So if you follow the blue line you cannot read it so I have to explain it. This one just reads the shape and the blue one underneath is comparing the shape it just read from the dynamic object with a constant value which is part of the compiled code. So if this succeeds it goes it goes it goes here where it reads the value from the from the self object and it uses this small gray rectangle is the final value the offset where it's read from. But the real advantage is that if you have a lot of instance variables in the method like in this one which is some zero-page mode for accessing memory from the benchmark the checks for reading the instance variables are actually merged together. So again this is just a subset of the IGV graph and you can see there is a many equal checks which is all checking like the shape of the method is the one we are expecting. But this is actually after one more pass optimizing pass in the graph compile you see it used to just free-checks for the whole method. So the next thing is splitting optimization and we will have a look at again at the source code for the CPU code. As you can notice there are two send method here in the operation method. And this send method is in our implementation represented as a tree of nodes. And if we had just one tree for a send method then we couldn't specialize for these different cases. We are calling the send method on different places and there may be different modes and instructions. It's calling different methods. So we want the call to send and specialize differently in these two places. For that we use two dimensional polymorphic line cache. Because actually I have the example here the front set method actually in this benchmark is called with six different modes and second send which is calling the instructions is seven different instructions which are represented by seven different methods. So we want to use the cache to specialize the two sends differently. So for that we use splitting and that means that we just take the original tree representing the body of the send method having copied and they have two copies of it. The first one having the cache specializes for the first send node and the second one specializes for the second send node. So let's have a look a little bit more how this works. So at the beginning there is again an initialized node inside of send tree and when the first method when it's called with the first method it inserts one node which checks if it's EPS which is an absolute access mode. Then it checks that the receiver is known type which in this case is always CPU but this is the second dimension and a type of receiver we would have another branch here. So if these two checks are correct then you can directly call the abstract absolute method on CPU. As different methods are called the tree of the send method grows caching all the different methods they are called. And this is again the representation from IGV where you can actually look at it how it was specialized for a given code. So in this case you can almost read the names of the methods and the memory accesses. So splitting is applied to all methods which is particularly important for the core methods like each to string which are called all over the source code so you have to want these methods to be specialized in the places where you are calling them. Otherwise you would have one tree for each method and it would get megamorphic very quickly which means it would not specialize for any particular case it would be very generic handling every place where you are calling it. And this is actually part of the truffle framework so it's not something we write directly in the truffle Ruby implementation this is something which is handled in the truffle framework itself. And the second very important optimization is inlining. So now we have splitted trees which are optimized for the different places from which they are called but if these we looked at how the checks for the shapes are merged for the access to instance variables. So we will have for example the AROP method and it's calling some of the small method and some of the instructions both of these methods also have accesses to instance variables on the same object. But the methods if they are not inline that means they are part of the different compilation unit so the compiler cannot see the checks which are done in one of the memory mode access methods with the checks which are one of the instructions so they cannot be merged together or eliminated. So for that you need to inline the methods to the caller so you get a bigger compilation unit to be able to eliminate these checks and guards which is done very simply by just taking the tree for the there was immediate memory mode access so you just take the tree which is representing the body of the method and you just copy it to the caller and that's it basically. Yeah, now basically I already said that again this is part of the truffle framework so it's done in the truffle framework which means any language written on top of the truffle framework will get these optimizations. So this means for example I already mentioned it if you remember I showed a little bit how we do the types for instance if we have the plus method on fixed num this is reduced just to two instructions after compilation after inlining but also the other consequence is that we can inline blocks eliminate the overhead of blocks because blocks are very good very important abstractions in Ruby they are used very often so it's good that we allow developers to leave them in the code to keep the abstractions but actually not paying the price in performance so for that we have a look at these we compare these simple methods when the first one is the one we've already seen just to read from instance variable the second one does the same thing but it's just wrapped in a block so these are the IGV graphs before the optimization passes for the two methods this is after the optimizations as you can see it's the same the overhead of the block the allocation of the block everything was eliminated so in conclusion what makes a tougher Ruby run opt-carot benchmark nine times faster than MRI it's not a single optimization it's several things and I didn't even cover everything but just the major ones so it's the splitting, inlining and the partial evaluations which allows you to eliminate the overhead of having the ASD interpreter and a high quality compiler like graal which allows us to produce high quality machine code and of course we also do some optimization around array access which I didn't include into this talk because I thought that I won't have the space so just in a short as we are able to specialize for small integers we are able to specialize arrays that if we see that some code is storing only integers small integers into the array we don't have to allocate an array of objects we allocate just an array of ints of primitive values which also gives us some performance benefits I would like to acknowledge also all of the people which are working with graal, truffle and truffle Ruby as I mentioned this is a research project and that's it so I thank you for your attention so we have about 15 minutes for questions one of the things that follows me about truffle Ruby is are you going to be able to use most of the existing Java tooling to work with it for instance can you use visual VM with truffle Ruby or something like that the question was if we are able to use existing tooling for Java and that's for yes for example the visual VM I think there is some ongoing work to improve it so it won't profile just the Java parts of the truffle and stuff like that but it will actually understand a little bit the languages implemented on top of the truffle so you can use it to actually inspect the languages implemented in truffle not just the Java and we also have Debugger which is independent on the implementation so you can debug any language written on top of truffle and because the debugger is independent and because Polyglot which means you can easily call one language implemented in truffle to another language on top of truffle so we can of course debug it through calling different languages in one runtime thank you any more questions sure so when you're generating your two-dimensional inline cache what kind of restrictions are you making on whether or not you're actually going to generate specializations or do you just sort of wait to see how many things go through that? there is a limit I didn't mention it in the annotation there is a sorry I didn't repeat the question the question was how do we deal if there are any limits in the two-dimensional cache so there is a limit value on the annotation for the cache and it says how big it can grow after it exceeds the limit the whole cache you can configure it you either can the whole cache show away and replace it with one generic node which always knows how to call any method or you can keep the current cache at the bottom up and the generic node which handles any other calls do we need Java for running truffle-luby pressure currently you need jvm with jmci api so you can use growl there is a build of java 8 with it java 9 will have it when it's released so it won't be any problem java 9 but there is also another project substrate vm basically ahead of time precompiles the whole implementation of the language including truffle and growl so on one hand you lose java part on the other hand because it's precompiled we get the start-up time to around 100 milliseconds which is quite close to mri so the hello world is just 100 milliseconds I have a follow-up question in substrate vm one thing I haven't understood is when you go from the normal java vm to substrate what do you lose because it's ahead of time compiled so during that time it analyzes all of the parts which are used from the java standard library and it compiles ahead of time only the parts which are used and during the time also takes certain assumptions that it can do it does a global analysis so it can see this method with this name is always called on this particular receiver and because it's called ahead of time you cannot go back from this assumption so we have to forbid class loading another java classes which could break this assumption so for example on the substrate vm you cannot load the new java classes because of this it's not entirely true because there might be also a java on truffle and then it will be able to you also lose some of the things from putts but like the GC or something like that yeah right it's a different vm so it has a different GC other questions thank you Peter