 2.6. First, I'm named Kokubin and from ARM, which is I'm a grad. My company was previously treasured data, which is acquired by ARM this year, so I'm now from ARM. And I'm a Ruby cometer for maintaining EOB, template engine at first, and then now I'm maintaining the main, maintaining JIT compiler now. So you may not know JIT compiler, but last year I experimented an optimization idea called JIT compiler last year in April, and then immediately after the Ruby 2.5 release in last Christmas, I prepared a pre-request to merge a JIT compiler overnight and just merged in early this year. And this was good for having a much longer time for improving the JIT compiler for Ruby 2.6, and I've derived that for about 10 months after March, and then I'm currently a number two cometer of the Ruby 2.6. I've been working a lot for JIT compiler. Then I got a revised 2018 this year. Thank you. So this trade stock is all about the Ruby 2.6, and Ruby 2.6.0, it has a JIT section. So what's JIT? Do you know that? If you know that, raise your hands. Okay. You know a lot about JIT compiler. So it's an abbreviation just in time and plus compiler, but that's not explained so much. So let me explain the history of Ruby implementation. So historically, Ruby 1.8 was parsing the Ruby core to a tree called abstract syntax tree, like press for return a local variable and b in the right hand and traverse the tree each time. So it's slow if the tree becomes a lot complicated or longer or higher, but after that, Koichi introduced the Ruby virtual machine. It compares that tree to sequential instructions, and it's faster than traversing a tree, and so it's a little faster, no, no, so much faster. So it's a current implementation of the Ruby 2.5. After that, so I did this. In 2.6, it's compared to native code. It's specific to the machine, which is running the Ruby interpreter. So it's less complicated. No, no, no. It's less calculation. So if we use this virtual machine, it needs to integrate the get local instruction and do a lot of things to get a local variable a, but in this time on native code, it just rows the first argument of the method to register and then fetches the second argument to register and then press just run the add instruction. So it's a lot of faster than dispatching the press instruction. So this is a JIT compiler. But how can we use it? So it's that optimization is still experimental, so it's not enabled by default. You need to pass a dash dash JIT option to enable JIT compiler, or if you are not using the Ruby command, if you are using it as a command, you need to pass an environment variable that has a dash dash JIT. So let's use that. So then we benchmark that. There's a benchmark called opt-cat for aiming to achieve the Ruby 3 by 3, which is aiming to achieve the three times faster Ruby 3 than Ruby 2.0. And basically, this is a NES, a Famicom emulator that achieves 20 frames per second by the Ruby 2.0, and it becomes the 60 frames per second that's good in the, hopefully in Ruby 3. In the machine, my machine is a little strong, so it's not 20 FPS by second by default, but it becomes already 2.5 times faster in JIT compiler. Thank you. Yeah, and it's even faster than the previous 2.5, 1.8 times faster. So the three black bar graphs are just this year, and left six bars are six years. So this year, we had a lot of progress for Ruby interpreter's performance. So we already achieved Ruby 3 times 2.5, only remaining 0.5. But how about other benchmarks? It's just a Famicom, and I guess nobody is running the Famicom in the production. So I got bug reports, and it throws them in a reservation, and Sidekick is thrown in by JIT compiler. So what's happening there? So JIT makes things slower, so today's topic is all about the JIT compiler's performance characteristics. So there's a lot of tradeoffs, and you may want to care about that for 2.6 since it's still experimental. In the future, I want you not to care about that, but for now, we need to care about that since it's still experimental. So the first topic is about when does Ruby become slow in JIT compiler, like Rails or Sidekick, like that? Yeah, the first thing is that when there are many JIT methods, it becomes slow. So in previous, Sam's report was compiling 10,000 methods in Rails because Rails has a lot of methods, and the number of maximum compiler methods is just 1,000 by default. And I think it's even too... At first, I thought it's too small for Rails or no serious applications, but it turned out to be too bigger for computers. So why? So this is because the JIT compiler by C compiler and dynamic loading of the current implementation. So currently, this JIT implementation is called MJIT, which is invented by another person, but I'm maintaining that, but that invokes RSO. After Ruby 2.6, when dash dash JIT is parsed, our native thread called MJIT workers spawned in addition to the Ruby main thread that doesn't hold a GVL, so that can run parry, and that spawns a GCC process under the Ruby process. So after Ruby 2.6 and Upass JIT option, you see Ruby entries a GCC or a CC1 on some strange, dangerous compiler processes, but yeah, it's intentional, and yeah, it's running in such kind of architecture. And it uses the compiled binary. Compiled binary is placed in the disk, and Ruby interpreter rows the dynamically rows the object file to the memory by dynamic loading a function called DR open, and it has some limitations. So I'll explain the limitation of the DR open. So by using such dynamic loading feature, native function is loaded like this. So this is like two megabytes, it's a little bigger for computers, and so in the middle it has a very large unused space, and if the three methods are loaded, heap structure is like this. So when there are multiple methods, we need to seek very large unused space to roll multiple methods. So it takes time. So if we compile the 10,000 methods to this, 10,000 methods multiply two megabytes, very big. So it doesn't consume memory because it's unused space, but it needs to seek the very large distance. So it's slow. The second one is about when there are still methods to be jitted, it becomes slow. So it's strange. The first one was when there are many jitted methods, it's slow. And also there are still methods to be jitted, it's slow. So when it can it be fast. Yeah, very difficult problem. So there's a lot of results. And we, so I solved this problem in Ruby 2.6.0, preview 3, a little. So I introduced a technique called jit compaction. I just named that, but it's not a very normal name. And jit compaction does, it's performed only when it reaches the max cache, it's a number of methods, limit of number, jitted method. And it reaches the 1,000 method, it invokes the jit compaction. When it's invoked, this kind of loaded memories are compacted to this one, only one, two megabytes. So it doesn't have a large distance. So it improves the memory load overhead. But still, there's an unused space. And so it's not ideal, but it's much better than just wrote in multiple pages. So I'm solving this issue a little, but it's not so completely solved. So if we wrote 10,000 methods, it's still slow. Then another problem is CPU memory resources. So if we run a C compiler, of course it uses this CPU resources and CPU calculation on memory. So if your computer does not have a lot of spare resources, it pressures the Ruby's resources and Ruby becomes, could be slow. So if your computer is not so strong, like my benchmark computer, it can be slower than my computer. Also, when GCC or other C compilers are running, we sometimes work on GCC or Red PID since there could be a rest condition. So when GIT is running, if GCC is running at the same time, it causes a segmentation of fault. So we sometimes look at Ruby's main thread on GCC invocation or with PID, because with PID waits for process that is invoked by Ruby script, but also there's a process GCC created by Ruby process. So to identify which is created by Ruby script or MGT worker, we are tracking the PIDs of the chart processes. So, yeah, it also has a rock. So when there is a method to be GIT, it has a rocks and memory issue by the GIT compaction since it's not compacted yet. So the last one is when TracePoint is enabled. Have you ever heard about TracePoint? Oh, I think, yeah, yeah. So I think you may not know this. So TracePoint is a dynamic instrumentation feature. So it is used by-bag or yeah, by-bag uses dynamic instrumentation to step by step debugging. And also web console gem uses the bind EX gem that enables TracePoint by default and also coverage. So if you measure coverage of that test test, you may be already enabling that TracePoint. So there are four development and testing. So currently GIT is designed for production and also last year Koichi introduced optimization that is available only when TracePoint is disabled, even while he implemented that TracePoint. So he believes that TracePoint is not used even while he implemented that. But anyway, currently TracePoint is not supported for now. So if that TracePoint is enabled, GIT is also disabled for now. But in the future, it will be supported. So in summary, there are three situations that may make Ruby throw. And in OptiCard, it's a short benchmark. It's just run within only one minute or so. So it doesn't trigger GIT compaction because it doesn't reach the 1,000 methods. So it's pressured by there's three methods to be GIT. But in Wells' benchmark by some, he compared 10,000 methods. So there's a lot of pressures for that. So this is a matrix. But OptiCard could achieve the very good result because this downside is overcome by the benefit of the GIT compiler. So let's talk about the benefit. So what is made faster by GIT? So almost all methods are fast by GIT compiler. So this is a silver bird part of GIT. Yeah. This is because when Ruby virtual machine is running, computer has registers and instruction pointers that calculates for the Ruby interpreter and virtual machine itself. And Ruby virtual machine has a stack pointer and program counter. You don't need to remember that, but that's just pointers. And just pointers are calculated to calculate the GIT local instruction and send instructions like that. And GIT local is executed. A local variable is pushed to stack and stack pointers moved. So these two pointers are moved by the virtual machine. But when we move to GIT compiler, we don't need to use stack pointers and program counter. So stack pointer and program counter is pointers, and so we need to control memories. And so there's memory pressure. So we can eliminate the pressure by using the GIT compiler. And GIT compiler just moves the native instruction pointer and the registers. Then the second one is basic operators on core classes. So there are some methods that is specially optimized by virtual machine. So things like plus, minus, multiply, division, and less than, less than and equal for things like that. So such kind of basic operators are optimized by virtual machine. And since it's optimized, GIT can easily optimize that by using the method in learning. So actually, virtual machine can't inline the method for now, but GIT can easily do that if it's optimized. So then by inlining the put object one and put object two, GIT compiler can, GIT compiler or C compiler can calculate the result of the one plus two because these bars are inline in this three method. So it just returns a three. It doesn't have an odd instruction. So C compiler can inline such kind of inline instructions which is optimized by virtual machine. So these are, if your Ruby application are using these kind of methods, it's likely to be fast. And also the third one is covering Ruby method. Ruby method means a method written in Ruby. So I really talk, I read about the method dispatch of Ruby virtual machine. And if there's a script like this, foo.bar, but when prior to reaching this line, we don't know what's the class of the foo and what's the method implementation of the bar. So at first, we need to search method from the receiver, getting the receiver class and troubles in the class hierarchy and get the method definition from that tree. And after that, we got the method was, we found that's a foo.bar method and which is written in Ruby. And so we switched the implementation virtual machine to call virtual machine again because Ruby method is evaluated by the virtual machine. So there's some switches and especially method search is slow because it's so complicated and dynamic because Ruby is dynamic. And but for that, virtual machine already has a cache for that inline cache. So for each line, it has an inline cache. And if it has a cache, so when the line is evaluated at the second time, it verifies the cache. And if it redefined, it searches the method again, but if it's not redefined again, redefined, it just uses a cache value and invokes a Ruby method. And foo.js compiler, we just, we would be able to generate a call that knows the foo.bar is a Ruby method and that's a foo.bar. So it eliminates the many branches like switched to the typos method and other things inside the verify cache and between the verify cache and method entry. So it reduces the branches and so it doesn't rely on, doesn't need to rely on branch prediction or it just reduces the memory access for reading the method entry because as we know, that's foo.bar method entry. We can inline the pointer or address to the method entry in the generated code. So it's faster. Also, similar things can be applied to instance variable. So the last thing is this one and so for instance variable, when there's a code at foo, we don't know which address has the variable for the foo. And we need to search the address and index of the heap. So this function has many branches and complicated and it's slow but after we find the index, we can see the instance variable. But if it, in second time, we have an inline cache and if it has cache and it judges it's the same class or not and if it's the same class, it just uses index two again and it because of the instance variable. But this, in this case, if it's jit-compiled, we can know, we already know that index is two because we execute the jit-compiler just in time. So we know that's two and it's inline and so we don't need to read inline cache and so rest memory access and also searching index has a lot of branches and so c-compiler is confused by a lot of branches which is the main path and by reducing such complicated paths, the code is very much faster. So in summary, there are four types of situations that can be faster and opt-carrot is a strange program which is very instance variable specific. So opt-carrot accesses instance variable a lot of times and so opt-carrot has a very hot method called render pixel and it reads a lot of instance variables and the performance of opt-carrot is very rise on the performance of the instance variable. So Ruby 3 times 2.5 is achieved by instance variable access optimization. And so for Rails, I don't know how often it could use instance variable compared to opt-carrot but other authorization can be still used for Rails. But I guess basic operators on core classes are not so used often on Rails because hash is core classes but hash with indifferent access is not core class, of course. So if it's wrapped or and also string has active, suppose, safe buffer and things like that. And so there's if there's we are using a wrapper for it or using areas like blank. So Ruby virtual machine is tracking the redefinition of the empty method named empty but we are not tracking the method called blank. So when we call blank, even while it's against to blank, empty, it doesn't become fast. So Rails has some brokers for optimization and so it could not defeat the 10,000 methods. So I want to improve that. So I want to talk about the future of Ruby. So Ruby 2.6 will be fast only on the Famicom. So the future idea is that so this opt-carrot benchmark is not using any creating any object. So it doesn't test memory allocation performance. So for optimizing the object creation, we can use a technique called escape analysis which is just judges this object is not used by other methods or outside this method. So if it's not used outside this method, we can allocate the memory to just this frame. And if the object is allocated to the frame, we can allocate the memory by increasing just a stack pointer to increment and release the memory by decrementing the stack pointer. So it's faster than calling Morocco free and so it could be fast. But this year, Koichi introduced an optimization called transient heap which is for short object but it didn't have a lot of impact for Rails application. So this may not have impact either. So this is hard. But still, there are, I believe there are still rooms for improvements even in Ruby 2.6. So I think we can change the heuristics to trigger or compact JIT because so if there are, so on Rails application, we could JIT compile the method that is never used on production or just used for initialization. So it just pressures the memory and if we change the heuristics or logic or strategy to close or release the JIT method which is decided to be never used, we can change the strategy to improve the JIT compile performance. And also there's an optimization in GCC called profile guided optimization. By passing the option profile generate to GCC, we can generate a binary to profile a binary. And once we run the binary and we can, so the profile binary generates a profile result and if we pass the profile result to the GCC, GCC can generate a better binary which is like, so if there's a method function which is not used so often, we can move the memory to not so used area and it can be fast because very important code that can be in the similar place and memory pressure is reduced. So it's good for branch predictions as well. And also I already implemented the experiment method inlining. So by using the method inlining, 10,000 methods can be just some methods if the paths are not so many. So we can improve the memory pressure by using the method. We could be able to, we may be able to improve the memory pressure by method inlining. And this is already, I have patched for this and so this is possible in Ruby 2.6. And probably three of Ruby 2.6 is already released but the Ruby 2.6 release manager is my colleague and so I asked until when I can introduce a big change for 3.6 and he said November is fine. So we can, we'll reduce RC1 in the beginning of December so until that we could be able to do that. So prior to December, we want to experiment and improve that. So to improve that, we need benchmarks. So last week I tweeted this. I want Ruby benchmark which is made slower by passing JIT option and I got many tweets. Thank you but I'm a little sad about that but it's not so bad because it's not, Ruby 2.6 is not released yet. We have one month or two months to improve that. So it's very welcome at this moment, at this moment. Don't do this after Ruby 2.6 release. And if you are considering creating a benchmark for Ruby 3.3, I recommend benchmark driver gem which is created by me by Ruby Gunn project. So this is located in the benchmark driver slash benchmark driver and this is very, probably you can't read this but this is an example for sending progress to Ruby gem. This is, so this is similar to, this output is similar to benchmark IPS. Benchmark driver has a lot of problems to configure how to measure and also how to output. So this output is by default like a benchmark IPS and also the characteristics of the benchmark driver is that benchmark driver can compare the multiple implementation of Ruby's. So this is comparing, maybe you can't read this but this is comparing Ruby 2.6 versus 2.6 plus JIT versus JRuby and JRuby plus invoke dynamic. So this kind of thing can be done by benchmark driver but of course it's not so easy in benchmark IPS or other popular benchmark libraries. So I recommend using benchmark driver. Also if you use a benchmark driver, most of the benchmarks in Ruby bench is already using the benchmark driver and since it's progable, output prog for Ruby bench is already there. So if you write a benchmark in benchmark driver, your benchmark can be easily added to this Ruby bench. Also I have a lot of benchmark continuous benchmarks that this monitors each commit. So there is a lot of, maybe you can read this at all but that's many time methods. So time class has a lot of methods and many methods are monitored, performance of many methods are monitored. And this year I experienced performance regression by moving to Docker because it calls a system call that reads the et cetera time and I reported that to my colleague and my colleague optimized time zone related things in time. So it becomes faster and it's monitored. And once we hit some regression, we can find that by this. And also of course this is achieved by benchmark driver and so you can add some monitoring by using benchmark driver easily. So we still have eight minutes but this is conclusion. Ruby 2.6, Ruby 3 is too early, sorry, experimental but there's one option if we couldn't improve the Ruby 2.6, we can use JIT max gas smaller compared to 10,000 methods to reduce memory pressure or early JIT compaction and also we still have time to improve Ruby 2.6. So benchmarks are welcome. Thank you. Questions? Yeah, that was, so the question was is this kind of example are real assembly instruction and this is some sued by, but it's assembly instructions. Yeah, correct. So that's achieved by using a C compiler for specific architecture and that's assembles for specific architecture. And the input is a C compiler. I said no, no, no, C source. So I'm doing just generating C code and C compiler abstracts away the architectures. The question was some strategy to improve the performance by utilizing the JIT optimization, right? And are you asking the patterns to improve the performance with this JIT or okay, okay. So there could be some ideas to improve the performance on JIT like using local variable instead of instance variable since instance variable could be allocated on heap and it's slower than just using local variable or stack, but I don't want to introduce that because that kind of thing would annoy the users of Ruby. And so I want you not to care about that by improving the performance. So all I want is just a benchmark for your use case and I optimize that for you. Sorry, sorry. So the question was is there just switch for compile all or just nothing or do we have a method to compile a specific method? But yeah, previously in my previous experiment in code error RB, I implemented a method to specify a method to be jitted, but also it would complicate the usage. So I don't want users not to care about the JIT compiler basically. So I intentionally avoided to introduce that method. So it's over nothing for now. And in the future, it's all by default and optionally disabled by nothing. So the question was how much the overhead of the JIT compiler and so the bottleneck of JIT compiler is for optimizing the optimization of the C compiler. So when we, when image worker generates a native code, image worker transpires the Ruby's byte code to C code, but it's not so slow. So it's some sub milliseconds and it's then it's C compiler is spawned and C compiler has a lot of optimizations and it's the most so thing. And so it takes about 550 milliseconds in minimum and about, it could take 200 milliseconds in my machine, in my strong machine that it could be slower on some weak machines and yeah. So some 100 milliseconds is the bottleneck for the JIT compiler. So I fortunately, fortunately, previously in privacy, I introduced the support for Visual Studio in the previous three. So it could work on the Windows, even when Windows Visual Studio and also it supports the clang, LLVM clang. So it can use the optimization of the clang and so the question was is it specific only available for GCC? And it's currently Ruby supports the GCC and clang and Visual Studio. Visual Studio, I guess most of people are not using Visual Studio for Ruby, but there's a Ruby committer who loves Visual Studio a lot. I supported that for him. The question was is it really slowed by, become slow by this memory pressure or locking and so things explaining or actually generated code could be slower or not? And the answer is that could be slow, especially for exception, internal exceptions. So if there's a Ruby block and inside break statement is there, break escapes the block to Ruby method, so it uses the internal exception, but on virtual machine exception can be simulated very easily because all of the Ruby methods is calculated in the same virtual machine code, but in JIT compiler, we can't escape easily from the one block to outside the Ruby method. So it could be slow, but usually it's not the bottleneck for that. Yeah, so that's my intention of the last feature and I may be able to create some heuristics to avoid such stoneness generated code. So the question was am I using my many benchmarks other than OptiCaret, I guess, and sometimes I use a discourse benchmark which is a Rails application and which is my main motivation to improve that. And yeah, I'm also running the Rails application in production and so I want to improve Rails application performance. So I mainly use OptiCaret and discourse benchmark because discourse takes a lot of time to measure, so I cannot use a lot of benchmarks, but mainly if I, when I do optimization, I test OptiCaret and discourse benchmark mainly. I think it's time. Thank you.