Rubyでオプティマイズを説明することを説明しますまず、私自身を紹介しますI'm ShoheiI'm a bit difficult to pronounce for youbut, call me ShoheiI'm one of Ruby Quark Committersince 28 or so, I rememberI was the moderator of 1.829or, I actually created Ruby Vipobut then, I was not active developer for a whilebecause my job was busybut I changed my job in last Februaryso that I can now develop new thingsI will show you the thing I developed recently todayso let me briefly overview this talkI implemented what we call the optimization density on RubyFusion 2.4under some benchmark, I'll show you laterit boosts execution up to 400 timesdepending on benchmarksalso, because this is the first very fast attempt of this kindit makes lots of rooms for future optimizationsnow, well, I knoweveryone has something to say about this butRuby is at least not the fastest language to runthis is a screen shot of language shooter sitewhich computes various languages in their speedsthe chart is comparison of several languagesoh, sorry, I can't show the cylinderokay, the orange line is Rubyso, you can see it's not theand the fastest one is on the leftthe slower one is on the rightso it is clear that Ruby is not the fastest languageyou can see it's in the right sideand what is interesting is that JRuby is on the right of Rubybut it's the chart shows JRuby is actually faster than Rubybut it's kind of mistakenly within the right sideit's very interestingso there are many reasons explained why Ruby is slowlike because we have GC or we have GVL or anything like thatbut I'd like to say this is all wrongRuby is slow because Ruby is not optimizedthis is actually disassembly of how 1 plus 2 is evaluateda bit noisy but all it does is put object 1, put object 2then send plusbut wait, 1 plus 21 plus 2, it might be 3 rightthe way we evaluate this is too complexit should be just like thisbut it's notthe reason behind why we can't do thisbecause I said 1 plus 2 must be 3it's actually wrongit could be very fine, dynamically and globallyso it is pretty difficult to make sure1 plus 2 is always 3 beforehandthere is a reason why we calculate every timethat being saidredefensions are rare kind of1 plus 2 is arguably 3 every timepeople are not able to break your first grade numbersso of course redefensions are the part of Ruby featureso they must work if they dobut the problem is should they really work fastso I'd like to introduce a mechanism called de-optimizationlet's just forget about redefensionsbecause they won't happenand only when they dowe can stop andslow away everythingthey fall back to naive evaluationthis is actually another headso it makes redefensions slowerbut like I saidredefensions are rareit works mostly fastnowwe're taking forward approaches to tackle firstwe do not compile the sequence into machine native oneswe do not introduce new binary formator new type of executionswe are going to optimize everythingexisting execution sequenceto a more sophisticated onesand doing so we are going to overrideexisting execution sequence on the flythis means we cannot change the length of the sequenceonly modification that preserves lengths are possiblein reality we can fill knobsso shrinking the instruction is kind of possiblebut to speak strictlywe can't change the lengthwell this diagram shows a part of Ruby's internalsin a Ruby processVM instruction is pointed fromIC encoded field of RBIC constant body structwhose length is IC sizeoh wait waityeah this onecan you show thiscan you see thisthe structleft side there's a struct RBIC constant bodyand it points to a sequence of the instructionapart from themthere are program counters somewhere outside of the structprogram counters typically resides in the machine stackso they apart from the management structurenow in our implementationtwo new fields namedICD optimized and created atare introducedthe ICD optimized is a simple copy of IC encodedwhat was mean about husband herewill visit the created at field laternow alrightI assume we have some optimizationsIC encoded got changed fromwhat was originally beenhow we cancel thisit's quite easywe save the sequencefor the reasonso just write them backthis is the actual implementationof the optimizationnznothing abbreviatedthis is the wholeof course you can seethe main procedureis MCPYin the last two linesso what is the advantageof this approach isit's firstit's expected to befile portablebecause it is written inpure Cand no G2native assembliesinvolvedalso the optimizationdoes not touchthe programcounter at allthis is particularlyimportantbecause wedon't have tobother any VM statesbecause the onlything thatshould be preservedis theinstruction sequencepreparation is doneonly onceat the beginningthis isalso an advantagewhen weencounterhighly evilsituation likewell tonsofredefinitionscontinuingto happennowbecausewe haveto knowwhenredefinitionshappenswe aregoingtohavea newstatevariablethis isthe globalvmstime stampwhich isincrementedwhensomethinghappensfor instancelikeconstant assignmentsmethodladderfinitionsandmoduleinclusionthis isimplementationa bit longsonoteverythingisshownherebutyoucanreadreintroducea newstatevariablestaticrbcotandblahblahblahandatmecinkthatpartit'sverystraightforwardthis isthewheretheautomizationhappenstheimportantpartisthebottomhavewhichisamacrownamedcallmethodsoreallyspeakingwedotheoperationrightaftermethodmethodmethodbecausethis is because theincrementationofstatevariableultimatelyhappensinsideofthiscccallcanyousee thiscancccallmethodsowhenthecall returnstherearechancessomethinghappenedwe aregoingtotestthathereanotherpointwheretheoptimizationcanbekeaktisvmpushframewhereafunctioncallstuckframeismade upthisoneaimsagoand becamestayatsomepointahugeadvantageofthisapproachisitaddalmostnooverheadsthegraphshowsapriminaryexperimentwithalltheoverheadsexperimentsoforitinvokesmethodsmanytimeswhichshouldbeshouldrunthemodifiedbud.thegraphshowsitactuallyitonly addsslightoverheadsitdoesaddoverheadsbutslightlywhich is within the margin of error.Okay, let me summarize what was shown.We introduced the optimization engine on Ruby.It's main characteristics include consistency of VM states such as program counter.As a result, the engine works very lightweight.Now, we are ready.Let's have optimizations.Like I said, we refrain from touching VM states under such restriction.It is still possible to apply some kind of optimizations like this.Elimination of methods, folding constants, and eliminating variables.Let's first see the constant following.It is how a sequence changes before and after optimization in this format.You can see that there are several instructions before.Several types of instructions like getting line cache against constant, blah blah blah.which was transformed into one put object in a sequence of naps.That naps are meaningless.They don't have anything.Just to fill the blank area of the sequence.So, ideally, it changes several instructions into one put object.This is in fact pretty straightforward.Because constants are already in line cache.Can you see this old one?Get inline cache and set inline cache is shown.That inline cache is already storing the result constant.So, we replace the set of sequence with already cached constant.This is implementation.Can you see the header?It is actual implementation to get inline cache.The complex if condition, you see, is testing if the cache hits.And if it does, before we jump to the destination,but before we const fold using the value.This is the IE6 const fold.Very simple.It writes the buffer with what we call a wipe pattern with a series of naps.Then, fill out the first two words.The first two words is put object and constant.By applying the same technique, we can fold one plus two.As you see, the generated output, put object, is identical to the case of get a constant.So, you can apply the same thing.This implementation might be a bit small to read, but it's up to plus.It calculates the executed value as usual and falls itself with that value.Not showing in the diff, but all the four basic arithmetic operations behave the same way.Now, next, send the lamination.In this example, we call method n of receive a self.But we are immediately guarding this return value by adjusting that one.It's a waste, so we squash them into a series of naps.That optimization, however, is not always possible.It depends on how a method is called and how that method behaves.So, in this optimization, let us call a method pure if the state is safe to be eliminated.The line is drawn like this.If it's right into a non-local variable like global ones, it's not pure.When it calls to a block, the method itself might not have any side effects,but the block itself can have side effects, so calling block is ng.Also, when a method is written in C, in this case, that C method can be pure, in fact.But we have no way to detect it.So, we can't say anything.So, we err on the side of caution.Lastly, when a method is calling another method, in the case, the entire call graph of the method should be pureto say that specific method is also pure.Let's have some examples of methods that are not pure.The left upper side, first one, is the accessing an instance variable, so it's ng.The upper right one is calling time.now, which is written in C, so it's also ng.The middle left one is yielding.It calls block, just it depends block, so it's ng.And hash withdrawal.You might think it's okay, but in fact, it's actually doing the calling a hidden C method inside.This is some implementation detail, but we do so, so it's ng.And lastly, of course, the RB define method, which is, of course, written in C, is ng.So, given the so many parts of methods that cannot be eliminated, there are actually, is there any method that's actually pure?So, it's, yes, there is.For instance, these two.They are, well, selected to be non-minimal, non-trivial examples.So, there might be other methods that can be pure, but for instance, the left one is the infamous left partner algorithm, which passes a string to another.And right one is, seems to be some kind of numeric algorithm, but it's in fact called a life needs formula that returns a value of pi.These methods are written in pure Ruby and has no side effects, so they are pure.Not however, that a method is either pure or not is too simplistic, because there are situations where we can't say if it is.Suppose, for instance, when we call a method and that results in a method missing.In that case, we can't make sure if this is pure or not until we actually use the method.So, in short, we have to detect a method's purity on the fly.At the beginning, everything are marked as not predicted yet, say, and as the evaluation progress, some part of the method is detected not pure or sorry, pure or not.And finally, when everything fixed, the purity of the entire method is set up and then propagated to its colors.All right.Let's say we know a method is in fact pure.Still, that only isn't enough.We have to check how the method is called.That is, if the method is, if the method's return value is used, we can't eliminate.So, we focus on the return value.If a method is immediately followed by a proper instruction, which means the return value is immediately discarded, then it's okay. Otherwise, the return value is used somehow, so we can't.This is where we eliminate the sender instruction.It happens inside of adjusted stack, not in send.That if it's mostly commented, it only was the VM eliminated instruction, which is this.At the site, it is very similar to the constant folder.It first washes the sequence with the pattern, which is not, not, not.Then, if argument RQC is not zero filling the first two words, the RQC maneuver is somewhat complicated.For instance, in this example, we call a method M with its argument M.If a method takes arguments, then that argument might have their own side effects.So, we, even when we can eliminate a method call, its argument must be retained, remain untouched.So, we have to, we have to fix the stack station state.The argument is pushed on the stack.The send is eliminated.We have to pop that variable.Next, variable assignments.In this case, we are eliminating the assignments of local variables.The set local instruction is eliminated in this example.Sendness is the implementation is not that simple compared to other ones, like 279 lines of code.So, it's a bit difficult to show you, but let me tell you briefly what is going on.It's not easy to tell you if a variable is not used at all strictly.That must be called lifeless analysis, and it's very heavy.Because we have to optimize on the fly, we need to do something like this.So, this time we check if a variable is right only or not.Also, because there are bindings, methods that are not pure might be subject to get bindingsand touch local variables from outside of it.So, we need to restrict local variable elimination to pure ones. Lastly, local variables are shared across blocks.So, blocks must be checked and if blocks can nearly be nest, so we have to check recursively.Okay, let's summarize what was shown.implemented stable optimizations.For those people who major compilers, it is obvious that what I did was very fundamental, very simple ones.No complicated new things.These optimization run on the fly and preserves VM states.I have not mentioned about exceptions at all because it has nothing to do with it.It doesn't interact with exceptions.Okay, now the benchmark.We tested what is proposed against 2.4 using Ruby's standard benchmark library.The condition is shown.I won't speak them out, but this situation I believe is somewhat normal.No new hardware, no new test suite.This is what you can do at home.This is the entire test result in one side.The speedup ratio 1.0 means the speed is even.The greater number shows our strategy is beating the trunk and the smaller, the slower.From what we see here, almost all benchmarks are slightly slower, actually, than the original one.Among them, few benchmarks achieve extremely fast results.Let's look in detail.The results shown are those benchmarks that got faster.This is an execution time, not the speedup ratio.If you look at it, you find there are several sets of benchmarks that speed up to the similar execution time like them.This is because they are optimized to generate identical instruction sequence.The benchmark meant to measure different things, but the optimization converted them into identical ones.The same result.At the same time, there are cases where we slowed down.One of these benchmarks that is interesting is this VM2EVO case, which has tons of what I said, evil activities.This example shows the overhead of the optimization.It's marginable, I believe, because the optimization can vastly slow the level,but this one only has a few percent overheads.This is, I think, acceptable.Other things slowed down.Notably, the block-related ones are slow.They are because we have to recursively scan blocks in order to detect local variable usages.However, variable emanation is in fact powerful because not only the assignment,but also the entire allocation of the object can be skipped.It impacts very much when they work.The fastest examples in the benchmark here, those variable emanation got effect.For instance, the VM1GC shot light, it got faster because not because we touched GC,but because we eliminated the allocation.Again, this is the entire benchmark results.You can see it's very simple, it's low.You can see it's very slightly slow in general when they slow down,and drastically fast when they speed up.So, let's have the conclusion.I implemented what we call the optimization under CVB version 2.4.Under some benchmark, I showed you it boosts execution up to 400 times depending on benchmarks.Also, because it is the first very fast attempt of the kind,it makes lots of room for future optimizations.Let's talk about the future optimization briefly.Subwork expression emanation.It can be possible because that should reduce the size of sequence.So, it fits perfectly to our strategy.So, it can be done right now.More strict libraries analysis and escape analysis has more overhead,but might work well.So, it is also the subject to do.Also, if you choose to allow modification of VM states such as exception tables and such,then there will also be other rooms of optimizations.That's all.Thank you.I'd like to have some questions,but you might be interested in this is actually fast in your application.So, this is Rails-wise.I tried running Rails on this very laptop.So, I got the result.It seems only as overheads and no optimization was in effect.But the good news is it seems no memory overhead.Just the identical memory consumption.So, the benchmark optimization is well, it doesn't work,but as far as it doesn't work, it has no overheads.So, we still have lots of things to do to make Rails faster.This is the current station.Right?Thank you.