 Can you hear me? So my name is Falka Simonis. I'm working for SAP in the SAP JVM team. Probably even fewer people know that SAP has its own Java VM as well, just like IBM. But we are also working in the OpenJDK. So we have contributed PowerPC and S390 port to the OpenJDK. I'm the project lead of these two ports in the OpenJDK. And today I will tell you a story about some optimization techniques in hotspot and how they can lead to surprising results if you don't get the optimization right. So these are just some corner cases. But anyway, that's errors. And I will show you how we fix them. So this is the outline of the talk. I will tell you some words about escape analysis, which is a common optimization technique in virtual machines. And then some words about intrinsics, which is another optimization technique. Then I will shortly introduce system error copy, which probably everybody of you knows, and its specification. And I will show how too much of the before mentioned optimizations can break the specification requirements for error copy. And then I will show you how we fix the problem. So what about escape analysis? Escape analysis, it's switched on by default in hotspot. So if you run your Java program, you will get escape analysis. It detects the escape state of local variables. So it can detect if they escape your function. If they arc escape, it means if they just get passed to functions you call or if they globally escape. For example, if you assign the value of these objects to globally, to static or globally visible objects. And the results of the escape analysis is used, for example, to optimize locks, compares, or to eliminate allocations altogether, which is also called scalar replacement. The escape analysis is, of course, heavily dependent on inlining. So the more you can inline into your method, the more effective escape analysis works. But it also can analyze methods which aren't inlined with the so-called bytecode analyzer. But again, that's restricted to a certain number of bytecodes and call depths. Escape analysis in hotspot is based on this paper, which I show here, escape analysis for Java. You can download it from the internet if you're interested in the details. And it's implemented in these two optical escape CPP and compiler interface bytecode escape analyzer CPP in the hotspot source code if you're interested in the implementation details. So I will show a short example how this works. So take this program. We have a method called scalar replace, which takes an int argument. It then allocates a new integer array which has three times this int value. And we call a dot function, which computes the dot product of these two vectors. So if you run this with the Java VM and we only want to compile our scalar replace function, so we don't want to have any inlining, we get something like this. So this is the assembler code produced by the C2 just in time compiler. So actually, it does some argument shuffling. It loads three in the argument register one. The class information, so integer array in the argument register two. And then it calls new Java array. So this is the allocation of the array. It then fills the array with our first argument. So remember, RBP holds the first argument. That was the x value with which we called the scalar replace function. And then it loads the array again into argument one and two registers and calls the dot function. So because we didn't allow inlining, there's nothing more the JIT compiler can do. So if we allow compilation of scalar replace and dot and disable escape analysis, we have to do this because by default it's on. So I want to show you how this would look when it's switched off. Again, the first part is exactly the same. We allocate a new array. We fill the array with the value. And then because we have inlined the dot function, all the computation is done in place. So we multiply the input value. And then we add it three times to array x. And array x incidentally is also the return register. So that's basically it. So of course, we're inlining the method cuts faster because we saved one method call. So now we enable escape analysis and eliminate allocations. Which is, as I told you before, actually enabled by default. I just put these options here to emphasize that fact. And the result in code looks like this. So this, of course, is again much better. It's just the computations. There is no more allocation and no more calls. Because escape analysis has detected that our array, I just go back to the source code. So this array is not escaping the scalar replace function. We just need it here locally to do the computation of the dot product. And then we only return the result of that computation. So actually, this array isn't really needed. All we need is the array values. So the C2 compiler can detect that by escape analysis. And it eliminates allocation. And we remain just with the computation. So there are, if you use a debugging version of the hotspot, you can use this print escape analysis and print eliminate allocation flags to give you more details about how escape analysis works. And if you run that, you get the information. Escape analysis, it builds a so-called connection graph. And the interesting thing is this one. So the compiler detected the new instruction in your Java code was actually transformed into an allocate array node in the GIT compiler graph. And the escape analyzer detected that this array is not escaping the function. And then it detects that it can be scalarly replaced. And then it say eliminated allocate array. OK, so that's how escape analysis works briefly. So now let's come to intrinsics. So intrinsics are very old means of optimizing functions. They are available in basically every programming language. And they can be used for optimization, but also for implementing features which are not directly accessible from your language. So for example, in C, typically the mem copy functions are intrinsified. Or for example, in new CPU architecture support, for example, transactional memory, but there is no notion of transactional memory in C. Nevertheless, you can use these functions by a so-called compiler intrinsics. And they are also available in Java, of course. So there are actually quite a lot of thoughts for the intrinsics, more than 260 when I counted last time. And there are different levels of intrinsics. If you're interested in this topic, there is another talk I gave. And you can find the recordings. I gave them a single joke account. And you can find the recordings on YouTube. I don't want to go into the details here. I will just give you some examples how it works. So when you, for example, look at the system error copy function, which is Java lang system error copy. If you look at the source code in system.java, you will see that error copy is defined as follows. So this is the error copy function you need. You know or know, and it has this annotation. This is new in Java 9, hotspot, intrinsic candidate. So this means that hotspot may intrinsify this method. By the way, it's native. It's native since, I think, since the very beginning of Java. I just checked Java 1.1. It was the oldest version I could download. And it was native from the beginning, I think. Probably because in the very beginning, the developers thought that it was doing this in native code. I think today could just as well be done in Java because the JIT compilers got much better. But never the way it's still maybe interesting to intrinsify this, and we will see why. And the implementation of this native function, it's a usual JNI function. It takes the JNI environment, JObject. It takes the objects as JObject, JObject. And of course, calling JNI function has a certain overhead. Now, this is the implementation. It just does some basic checks if the arguments are zero and shows no point of exception, and otherwise calls the copy implementation of one of the classes. So let's see an example. Let's say we have this array copy function, which we've defined ourselves. It takes integer array as source and another integer array as destination. And then we just call system array copy from source to destination. From the source array, we will start at index zero and copy to destination array at index zero, and we will copy eight elements. So when we disable intrinsics with this XX options, you can disable intrinsics in hotspot. Again, they are on by default. So if I want to show you the difference, I have to switch it off. We will see the machine code which gets generated. And again, there is just some argument shuffling here because our function gets the arrays in different places than are needed later on for calling system array copy. So we just shuffle the arguments around, enter zero and eight for the specific arguments of array copy, and then we do a call, a static call to Java lang system array copy. And this will end up in a JNI call later on. So now let's see what happens when we use intrinsics. So the generated code looks a little bit more complicated, but actually it's much better because you can see we have no call here. We do a lot of checks here. We load the lengths. So RSI is the register which holds the first argument. So it's a source array and that index 16 of the source array, there is the length. That's the place where the length of the array is stored. We load that into our 10 and we compare our 10 against eight. So we want to see does the source array contain at least eight elements. We have to check that because otherwise, the copy wouldn't work. And system array copy is required to show an exception in that case. We will see the specification of system array copy in a second. So the loading of the lengths field from the array element is also an implicit null check. Again, I've covered this in one of my other talks, which you can see at YouTube if you're interested in how implicit null checks are implemented and are working in hotspot. So if, incidentally, this array should be zero, we will get a segmentation fault here. But that's no problem because hotspot has its own signal handler. It will catch the signal handler and it will transform it in the corresponding Java null pointer exception, which is required by array copy. So again, we do some of the checks here if all the arguments are correct. And then, starting here with this line, we just load the first element. The first element is at index 24 of rsi loaded into our 10. And we save it into rbx, which is the second argument at the same index. And we do this eight times until index 52, which is the seventh element. So actually, we have inlined the whole array copy into this function. We don't need any more call. And that could be done even more optimizations. For example, we could use vector instructions or even more fancy stuff. But that's actually how the usage of intrinsics has optimized system array copy. OK, so now to system array copy specification. It says something like the following. If one of the following conditions is true, an index out of bound exception is thrown. So if the length argument is negative, system array copy has to throw an index out of bound exceptions. So that's very easy to understand. So now let's look at the following test. We have an array copy for function which takes the source integer array and the length. And we will call system array copy and copy from this source array from index 3 into a newly created integer array with eight elements at index 5. And we will copy length elements. If this copy succeeds, we just return false. If for some reason we will get an index out of bound exceptions, for example, if this length is too small or too big, we will return true. And now we will call our array copy four functions from the main method. We just call it in a loop in order to just in time compile it. And we call array copy with a source array, which is 128 integers long. And we copy minus one elements. So obviously, if length is minus one and we look at the specification, if the length argument is negative, this should always return true, right? Because every array copy should just throw an index out of bound exception. And if for some reason we should get false, that's obviously an error in system array copy, right? So in this case, we print the error and that's it. So if you run this with Java, it runs for a while. And then we indeed get an error at iteration 5,376 index area out of bound exception expected, but we didn't get one. So why did this happen? Strange, right? And at this index. So maybe you know that this could be related to escape analysis, because otherwise they wouldn't have told you about it at the beginning of the talk. So now we can use these options, which I revealed before print escape analysis and print eliminated allocations. And indeed, we see before the error happens, the just in time compiler eliminated an allocation. And when we go back to the example, we see that here in our array copy function, we allocate this integer array, but this array doesn't escape. So actually, the just compiler can eliminate it, right? Because we don't need it. So and this also explains a bit this strange index number, because hotspot has something called tiered compilation. So first it runs in interpreted mode. Then after a method gets hot, it gets compiled with the tier one, so called C1 compiler. And after that method gets hot again, it gets compiled with C2. And escape analysis is only supported in the C2 server compiler. So that's why we get this error only in the 5,000 iteration, because obviously the interpreter and the C1 compiler, they run our program correctly and always return the networks in the index out of bound exception. So we could now try to disable the array copy intrinsics and run the example again. And wonder, wonder, it works. So if you disable eliminate allocation, again, the example will succeed. So this has to do, the error must be related to escape analysis and intrinsics. So let's see what codes get generated. Again, we see that we have some checks here, but the generated code doesn't contain any check for length being 0. It checks that source length is smaller than source position plus length, and it checks that the source argument isn't 0 and all kind of stuff, but it doesn't check for 0 for length being smaller than 0. So that's obviously an error. So this is the most complicated slide of this talk. This is how the compiler works in hotspot. So the C2 compiler, it starts to parse the byte code of your method, and then it parses the blocks and it parses the calls. And when it sees the call to system array copy, it checks for every function, it checks if it can be intrinsified. So if system array copy we saw, it can be intrinsified. It tries to inline it, and it inlines array copy. And instead of creating a call node, it creates an array copy node, which will be later on replaced directly by machine instructions. And the later stage, the so-called optimization stage, we run the escape analyzer, and the escape analyzer recognizes that it's possible to eliminate an allocation. It eliminates it, and it also eliminates allocate nodes. Yeah, it eliminates the allocation node, and even later on, it processes the user of these allocation nodes. Because obviously, when we eliminate allocations, all the users of the allocation can be eliminated as well. And then in the final stage, we will generate code for all the nodes. So generate array copy will replace the array copy node by machine code. So here is the inline array copy. This is where the call to array copy is replaced by the array copy node. This is from the actual hotspot source code. It has a very nice documentation, which say the following test must be performed. Because obviously, when you replace the call, we still have to attach to the specification. And 0.5 is 0.6 is the length must not be negative. But then later on in the code, we generate all the checks, but 6 is missing. Strange, right? So it's actually not that bad as you may think. Because here in generate array copy, that's where actually the code is generated, we have 6. So length must not be negative. We generate the code here. But when we used escape analysis before and we process the users of the eliminated allocation, we just eliminate the whole array copy node. Because it's not used, right? It just wants to copy data to some memory which is not used. So the whole array copy node is eliminated, and we don't even generate code for it. So we don't have to check anymore. So that's actually the error. So just to repeat, when we eliminate that, here we create the array copy node, here we eliminate it, and here we want to create a last check which was missing here. So obviously, intrinsification is a very old optimization technique, and it's in hotspot since a long time. Later on, escape analysis was added. And it just didn't take into consideration this combination. And actually, this code is used for more or also for other parts to generate array copy. Also for object clone and things like that. So that's probably why at one point in time it was a benefit to move this check to this later stage. So the fix is actually easy. Maybe we would just add it here. But then it may happen that we have two times the check. We have it one times zero. And if you don't eliminate the array copy, we have the same check a second time here. So we did it a little bit more smarter. We extended the array copy node and added a flag to say we add this check for length or not. And when we generate code, we just asked is the check already generated or not. And we only generated if needed. So this was that. I think I'm running out of time. The funny thing I just want to mention is when I was fixing this bug, I did wrote like every good program. I did wrote some regression tests. And incidentally, I found another bug in the C1 compiler for this specification here. So if the source argument refers to an area of primitive component type, and the test area refers to an area of reference component type, we should throw an array store exception, right? So I just saw this very fast. This is a variation of the array copy. For example, here we don't take an int array as source, but just an object. And we again copy source to object with length. And when we call this with an int array, obviously it's not possible to copy integers to objects. And usually this check works. But if you call this with zero length, there is a shortcut in the generation of this array copy node, which say, OK, if the length is zero, I don't have to do all these fancy checks anymore. So the same example, if you run it, we will get an array store exception already in iteration 256, because this is already in C1 generated code, which happens much faster. If you run it in interpreted mode, no error happens. If you switch off the compilation, so only C2 compiler will run. Again, we get no error. So this way you can find out if errors are related to interpreters, C1 compiler, C2 compiler. So this is here the code where this check is generated. So check if negative. We check here if the length, we do a check for length. If the length is less than zero, we jump to an error step. Otherwise, if the length is zero, we just go to continuation, which is at the end, before we even call the array copy routine. And this check here, check for the element types, is just shortcut it away. And that's where the error happens. So the fix, this was the bug. It was fixed in November. And all the slides and code is on GitHub, if you want. You can have a look at it. So thank you a lot.