 So hello everyone and welcome to my talk about the trust and time compilers in the open trade. Okay. I'm Martin Durr And I work for SAP. We are a small team I'm working on the JVM at SAP in Waldorf and This time we have three time three talks in a row So we will also see two of my colleagues later on But let's get started with with the agenda. So I will talk about the trust and time compilers Which translates Java bytecode into a machine code in the open trade, okay? I will also talk about how different compilers work together and I also like I like to address resource usage as well So first of all, how many compilers do we have in the open trade? Okay? We've already heard about two ones, but how many do we have we have one more? three ones So the one we haven't talked today yet is the client compiler, which is also called C1 It compiles pretty quickly, but with the lower optimization level And then we have the server compiler also called C2 We already had a talk about that one It's kind of the opposite opposite. It compiles slowly, but therefore with a high optimization level for example, it has a lot of loop optimizations and We've also heard about the escape analysis and there are still people on in working on improvements for that Both compilers are available on a lot of platforms Including PowerPC and S390 which are supported by our team and These two compilers are used by default and we've also heard about graal, which is Rather new in the open trade EK. It is still experimental in the open trade EK That means it is not used by default. You need to switch it on If you want to use it and it is developed on Github So updates get merged into the open trade EK Special to graal is that it is written in Java. That is a big difference to the other two compilers and It also does a lot of optimization. It has a more sophisticated escape analysis, for example Andrew has already shown a few things about graal. So thanks for that And it is optimized for dynamic languages by the way graal compiler is also called graal VM compiler and I'd like to show a few things Andrew has already mentioned. I got one slide from Oracle So thanks to Oracle for providing it there are three different use cases of graal VM and The graal compiler is always in the center of it together with the JV MCI the Java virtual machine compiler interface and The use case on the very left is the one which is available in the open trade EK So you have a Java application or Java methods which get compiled by graal and they run on the hotspot VM the Java virtual machine of the open trade EK So that path on the very left is supported by the open trade EK And in addition to that on the right-hand side you can see the native image technology Andrew also already mentioned and Everything gets compiled pre-compiled And there's something in between Where only the crowd compiler is pre-compiled into a shared library So that's the basic difference between this approach and this one your graal compiler itself is pre-compiled in a shared library So back to the different compilers I'd like to compare performance a little bit By the way, this is an old benchmark with an old JDK and an old garbage collector But don't care about numbers. It's I think it's good to get a first impression about the performance of the different compilers So at at the bottom you can see interpreter for for reference so that means we are not using any just in time compiler and you can get that by Specifying the runtime option minus x int that stands for interpreter mode So you will all you will only use the interpreter and no compile no cheat compilers at all And as you can see the performance is pretty poor Already much faster is the C1 the client compiler You can select that for example by using this flag tiered stop at level 3 That might sound a little bit complicated and I Have to have to note that Level three means that C1 still performs profiling so You won't get the best performance out of C1 by that if you want to tune C1 You would select stop at level one and then you would get C1 without profiling But in this case, I want the better profiling information. That's why I left it on and If you want to use the C2 compiler only you can switch off tiered compilation And then you get the blue line, which is already much faster And the default configuration uses tiered compilation and you get the fastest startup and the best peak performance The best peak performance also because the profiling information is better But I'll explain the tiered compilation stuff later on more in more detail So you should be able to understand that better at the end But for those who hate this old stuff I have also a slide with the latest JDK So the same old benchmark with the latest JDK 15 and you can Already see that the peak performance is better with C2 especially And you can also see the green line, which is new that is graal In order to use graal you need to use this switch use JVMCI compiler Which is an experimental option. So you need to unlock it in addition and Graal is the default JVMCI compiler So you will get so we will get graal by this flag So performance peak performance of graal is good Even for this very traditional workload graal performance is Of course better for more modern workloads For example, if you run Scala That's what Twitter does a lot and I should also mention that the open JDK only contains the Community edition of graal. There's also an enterprise an enterprise version available Which contains more optimizations. So you will get better performance with the enterprise edition And you can also see that the startup takes longer It takes a couple of seconds until here 4.5 seconds roughly to get peak performance And that's due to the fact that graal itself is written in Java So the graal compiler itself gets interpreted at the beginning and Then later on hot methods gets get compiled by C1 and later on they get compiled by which compiler The graal compiler itself so graal compiles itself and that takes a few seconds So this may be okay for large server applications where you can afford spending a few seconds but there's also a possibility to fix that if you need a quicker startup and and That's available with it with qualvm so the travi m Has a feature called print flags final And if you enable that you will see all flag configurations the VM sets for itself and You can also find use travi MCI native live use native library and With graal VM that one is true by default and that means The travi M is using the pre-compiled shared library So the graal compilers already pre-compiled and you get a pretty good startup with that So next I've promised to explain tiered compilation a little bit So tiered compilation is basically the answer to the question of how these different compilers work together As already mentioned at the beginning everything is Starts at the interpreter, which is tier zero and Then we have three different tiers for the C1 compiler Tier one is C1 without any profiling that is used only for trivial methods when the C1 Believes that it's not worth optimizing further. So we will stick on on this trivial compilation And then there's tier tier two C1 uses reduced profiling and it does that when it thinks there's too much work to do So we just should make it quick and the default Tier for C1 is tier three and you get the full profiling code compiled into the compiled method and Then finally the tier four is for the highest optimization level and it uses C2 compiler by default in OpenTraderK and You can replace it by graal if you enable it explicitly you can also see the tiers when you enable print compilation you can see which method gets compiled at which tier and Typically most methods get started started at tier three then you get also tier four method compiled by C2 in this case But here here's also a picture to explain that a little bit more in detail Everything starts in the interpreter as already mentioned at the beginning and the interpreter performs invocation counting and Once the invocation count of a method reaches a certain level Then a compile task gets generated in the C1 compile queue a C1 compiler thread can pick it up and create a C1 compiled method, which is a tier three method in this example and As already mentioned tier three also does profiling which includes invocation counting So this compiled code still does invocation counting and once a compiled method reaches or a method reaches this level Then a compile task gets generated in the C2 compile queue and similar to C1 a C2 compiler thread can pick it up and create the fastest version of the method and This is how how it works for method invocations But there may be long-running loops which without any method invocations and obviously The invocation counting will not help in that case That's why there's also back edge counting which works similar. So it's almost the same slide but here with back edge counters and Instead of the invocation counters with different limits And what happens here we the compilers generated so-called OSR methods which stands stands for on stack replacement and And There's they are special methods. They have an entry point for the loop and On stack replacement is called this way because an Interpreter method gets removed from the stack frame and it gets replaced by a compiled stack frame That's why we call it on stack replacement. So I've already talked about compiler threads How many compiler threads are we using? Well, that depends on on the machine. We are running on in the office. I have a 40 CPU Linux machine and When using print flex final, I can see that the VM selects CI compiler crown to 15 that is computed by a fancy formula and And it It's one third of them are reserved for C1 and the remaining ten in this case are reserved for the C2 floods and similar To compiler threads the cheese the VM also decides on how many GC floods to use which is 28 on my machine and Obviously these numbers are pretty high for simple workloads when you just do trivial things with your JVM You don't need so many threads. We already heard this morning that flats are expensive so we usually don't want that and That's why we have implemented a new feature that was contributed by us. It's called dynamic number of compiler threads We already shipped it with JDK 11. So it's not brand new, but it's the first time It is shown at a conference, I believe And what we do by this new feature We interpret these numbers as maximum numbers. So we start up to 15 compiler threads and I'll get back to that later, but we start one one part of each type at startup and additional parts only on demand There's a similar feature called Dynamic number of GC threads Which was already implemented by Oracle which has switched switched it on by default with JDK 11 and With that you get of course much much lower resource usage And It's still possible to switch these features off to get to get the old behavior. So all Compiler and GC threads get started at the VM startup I have tuned all the memory settings to very low sizes. So the JVM should actually Not use a lot of memory, but you can see virtual memory is pretty high here and that's Because of the threads they reserve a lot of of virtual memory or they occupy virtual memory on Linux due to the Trilip C and If you don't switch off these new features, you can see we get a much lower virtual memory usage It's from six gigabytes to one point five down But it's not about not only about Not only about virtual memory. We of course also save other resources But you can trace Compiler threads also by this flag. It's a diagnostic flag So you need to switch it on to enable to unlock these options and As already mentioned you can see that the JVM Starts initially one one compiler thread of each type so which is one C2 and one C1 thread and they get kept alive for the whole lifetime of the JVM and the other threads only get added on demand that depends on the compile queue length and Also on the available memory and code cache space, which is available Because we don't want to mess up things when the memory is already full We don't want to start any any further threads And once they these compiler compiler threads don't have any work left to do They will die after some time and They die in the reverse order. They have were generated so we don't have any gaps in the compiler list so that's the feature we are already using and One one remark on the memory usage of the compilers C1 and C2 compilers, of course use native memory and in comparison to that the Grail compiler uses Java heap So that may be an issue because your Java application uses the same heap and you may need to to select to configure a larger heap with a XMX flag Otherwise you may get out of memory issues and it is also solved by using this Shared library because that uses an separate heap Which is part of the native image technology? so it's not Doesn't use the the regular Java heap which you want to use for your application So that's all already it what I wanted to tell maybe a few remarks It is also possible to configure the compiler threads to to use lower memory for example, you can tune inlining But of course that may have performance implications and it is also possible to Set a node limit for the C1 compiler that will make it smaller or will limit the memory it uses But of course that has always side effects. So I Wouldn't recommend that in general So I'm sure we have time for questions left. Excellent Any questions? Yeah We need a microphone I'd see still here I was just wondering what the compiler thread count and heap size or virtual memory of size or whatever sizes look like when you force Tier one when you only run with C1 Is it I would assume it's fewer threads and less heat, but I don't you didn't cover that The virtual memory issue is due to to the malloc arenas from trillip see and the first allocation already Occupies it 128 megabyte block of heap. It's not really used. It's only virtual memory. So in most cases It's not really a problem But that is independent of which compiler it is or which thread it is it also happens with trower threads or with any other thread There's also another way to fix that you can configure the trillip see To use low or less malloc arenas. There's a malloc arena max Environment variable and you can limit the memory by by using that that they that may have impacts on on other performance things Because if you have many native threads, which perform a lot of concurrent mallocs You may get issues with that but for the JVM itself, it works pretty well. We have tried that we have Experimented with using only one malloc arena and the JVM itself still works quite okay because it has its own memory management and We are not so not so much using many concurrent mallocs small concurrent mallocs Good question actually. Thanks For the questions So I have a question For the for the server compiler and the client compiler the code cache is managed by the sweeper Is the same mechanism in implemented for the growl VM? The sweeper has a different as a separate flat, so it's no longer a part of the compiler flat so I'm not aware of any relationship between Growl growl compiler and under the sweeper maybe Andrew has has a few thoughts about that Okay, no So the short I'm not absolutely sure about that. I'm so I don't want to say unless I be sure But I do know that the the there's there's external others. There's a change made to like external Code segments not in the initial in original code cache and they're wrapped with a stub that points to them So growl is managing some of its own memory thing and I'm not sure how they get reclaimed But growl does know about deoptimization events It may be also as a way that can find out about the fact that something is has been released and there's a release protocol I just don't know for sure Okay, thanks I'm not really a growl expert. I've worked a lot on C1 and C2, but not so much in growl But related to the code cache There was a significant change back in the past. We only had The sweeper run by by the compiler threads and in the meantime we have a dedicated sweeper flat More questions, so I think we're done. Yeah, thanks everyone for your attention