 Hi, everyone. This talk is going to be a compiler talk about the C2 compiler in OpenJTK, specifically new optimization that we worked on, Microsoft surprisingly, but so cold stack allocation. In this particular talk, I'll start with quicker introduction what currently exists in Cospa C2, extend with the work we did in our engineering group, then show some results on popular benchmarks and hopefully kind of finish off with the stuff we still need to do. This is still very much a work in progress. We haven't submitted the patch to the OpenJTK mailing list, compiler list. But just a warning, I mean, maybe boring for a bunch of people. There's going to be a bunch of compiler jargon here. I can't work around that, fortunately. So stack allocation. So the main motivation was to alleviate some of the GC pressure. It was originally brought to us. We started working on this by Kirk Pepperdine. Some of you may know him. He's big on GC tuning in the Java world, and TU often would say as we're looking at some spark workloads to improve them and say these allocations to Scala gave the GC seizures and so on. So it's really bad. You guys should try to fix this. So how are we actually going to try to doing this is by saying we're going to eliminate the location, but not quite. We're going to allocate on the stack frame rather than on the heap. It's a known authorization compiler. It's just somewhat not actually being done in C2 yet. So when can we do this? Is when we say the object does not escape the current method context and typically objects escape with returns, calls to methods, or maybe stored into fields, passed around. There are places where we can do it, places where we can't. So here's an example. We have a very simple Java program here with three allocations. We got box to integers, and then we finally return another one. Integer is immutable class. So there's two additions, create a new object, and that's popped and returned. So the first two objects do not escape the method, but the last one does. And the way that we do tell which ones do escape, which was downed, is by this compiler back in pass called escape analysis. So a bit more about escape analysis. Now, escape analysis was introduced by this paper. It was introduced by IBM TJ Watson Research a while back, and essentially they had in the paper two different kind of optimizations described. Flow sensitive one, and flow is sensitive. The one that's implemented in C2 is the flow insensitive version of it, and it's the right choice, actually. The paper describes both of them, but it actually shows that they do similarly well, and the flow insensitive is much easier to implement, maintain, and less memory-intensive. Currently it's used in C2 for the following two purposes. So we use it for monitor elimination, which means if the objects are proven that they're not escaping, therefore they can only be seen by one thread. So we can eliminate the monitor, enter and monitor exit operations on these objects. So no synchronization on them. So you're using a string buffer instead of string builder. Well, this will help. The second one, which was interesting for us, it was the scalar replacement. So this is the form of an optimization that we kind of extended. So scalar replacement goes by the same sort of concept. You take out the original object, you break it apart into the individual parts so you actually make away the actual allocation. So breaking up the object turns it into a normal autos or local variables that sit on the stack. Therefore, no heap allocation there. So here is another example. Slightly different than before, but maybe you'll notice a subtle difference. People that know this stuff, but we are doing the three allocations as before. Now let's move on to the next step, and the scalar replacement will come in and actually turn it into this. This is what the final program would actually behave like. The integer field within the integer object, the primitive data type was extracted and actually stored as a local variable on the stack. Now when can we do this? Mainly when we actually don't need the original form of the object anymore. So when we can actually prove that the object as a whole integer is not needed. Now let's see some of the limitations with scalar replacement. So when we looked at the code, there's a number of reasons why scalar replacement can't fail. But when we did the analysis and ran bunch of benchmarks and workloads to see what was the main cause of scalar replacement failing in Cospot C2 is introduction of control flow. So namely, as this compiler talked again, like phi over here, we have the two definitions on two sides of the control flow. One is the object being instantiated, this new class object here, my class. And the other side, we're pulling it out of an array. Now coming down to the last return object dot x, now which one do we have in our hands? Therefore, we need the original shape of the object. We need to do a field load of that object. Because on one side may be scarily placed, but the other side will need a full bloody object coming from that array. So how common is this issue? So I'm back to my original example. So the side that we have on the left here, we can scale or replace that. But that's not what typically happens with auto boxing. The side on the right, C2 is unable to handle. Mainly because integer value off, which is what actually happens when you auto box a primitive data type internally has the exact same pattern I showed in the previous slide. It has a compare actually with two ranges, one 28 to 127, which is also configurable. If the values fall in this range, you're getting a pre-cached integer object from a static array, which is a poor man's version of elimination of allocations. But it has actually worked for that range of objects. But every time you go above that range or beneath that range, you get an actually heap allocated object. So can we make this work? So there's compiler optimizations that could potentially make this little example here work, however it has some drawbacks. One typical way we could do this is by actually cloning. Optimizations in the back-end optimizations could potentially say, well, we have a condition, we go either way, so why don't we just specialize the method body for this side and for that side. But let's say you have another branch, then it gets unwelled. It becomes exponential, so code grows insanely. So it's not actually very useful. Other ways you can do that is maybe by code motion, but then you're stuck with side effects. Let's say the array we're pulling out, that object, this object was a null object. So can you actually, what if you have to throw the null point exception? It won't be on the wrong line. So these kind of optimizations do have limitations in how actually often can we apply them. Now, this is what we actually came up with. We said, well, we don't actually have to scale or replace it. What if we actually allocate it, but actually allocate it on the stack? So the object shape is preserved, as it typically was, and it just lives on the stack just like the primitive. Which is a little bit of extra stuff. It has a flags field, it has a class pointer, everything you normally would expect from an object. Now, is this useful? Well, yeah, let's consider this example. Like you have this loop over here. I mean, I cleverly returned the primitive data type, so my object doesn't escape, but if this was the case here, this loop will keep generating new objects every time around. The integer is immutable, therefore, new object for every new addition you do. So, stack allocation. Will it work on previous example? Yeah, because on both sides, now we actually have a plane on Java object. So the object.x field load, the get field that happens there, has no problem existing. From one side, it will read from the stack. On the other side, it will read from the object that got from the static array. So how do we implement this in C2? Come to the second part of the presentation. So, Charlie Gracie and myself, inspired by the words of Kirk, started looking at, into implementing this with stack allocation. We had to modify escape analysis in C2 to recognize cases where we can safely stack allocative objects. Not all non-escaping objects can be stack allocated. I'll show some of the limitations later on, but there's plenty of them. We implemented the stack allocation path in macro expansion, so we had to actually write a separate path that took out everything else. We removed everything else, but the path where we stack allocative object. So how do we stack allocate? And this was one of the bigger revelations. We actually used BossLockNode, which was used for monitors, because mainly we needed a way to communicate stackOop, which was not done in any other way, from the IR back to the co-generator, to say, hey, this should be a point of reference on the stack somewhere. So right now, our stack allocative objects end up where the lock slots would be, which is right after all the spills in locals before the preserve registers on the frame. So at this time, we had to actually worry about it. As soon as we did that, we got immediately a search in the garbage collector and said, what the hell is this? You're giving me a new preference on the stack. That's not right. So we had to actually extend GC-ROO scanning to support these objects, because what it will look like, there will be a local on the stack that points to another stack location, which is doing quite good, right? A more kind of subtle issue I'll describe later is detecting live ranges of objects in loops. And the other two items are removing the right barriers. We obviously can't do them, because you do a card mark on a stack location and that's not good. And then the other two were already being done, a similar code we found for scale replacement. So we were able to leverage that. We had to kind of similarly implement verification objects on de-optimization. So any safe point that the allocation can reach, we had to inject this scaler replace alloc node, I believe it was called, where we describe which fields need to be copied over to a heap object. So here's the GC-ROO scanning. Typically what you normally see is now below the locals you have a pure stack allocator object. The first five over here will be a Flags field and we have a class pointer and some reference. So the GC need to be thought that, well, as you walk in the stack, you have a reference coming over here, you need to find all the loop fields and actually mark them so you don't actually lose any objects. Now this overlapping live ranges was kind of like a subtle gotcha. And to be honest, I kind of knew this, but I forgot about it. I used to work in IBM on the DistroSELP and J9 compiler, and we used to do this, but sort of had to relearn it from scratch. It's an interesting case where if these two objects, as we have them here, let's say, are stack allocated. As soon as we get into the definition, which is definition V2, where value equals result, what ends up happening is that these two addresses, which are actually addresses on the stack, become the same. So what that definition is, coming back on the second iteration of the loop, will be the address that result was before. But result is stack allocated. It's always the same address. So now all of a sudden, what you end up doing is, well, after you first enter this if, you never enter it again. So typically where this was a heap allocation, you get new address every time. You allocate from the thread local heap, buffer, or you allocate from the heap somewhere, but it's a new address. So your address comparison will work and the location where the object is stored is different. Well, once it's on the stack, it's always the same, which we want to do, we want to actually reuse this for the purpose of better cache, page misses, and also remove the allocation while we end up in this problem. So we have to go to detect this and actually reject one of them as a candidate for stack allocation. One of them heap the other one stack is fine. So we go into current limitations. So we have few limitations we can actually do right now. We don't stack allocated object with monitors. It's kind of side effect with box lock node. We just didn't finish the work. It's not hard to do. But some of the monitor elimination code eventually compacts our box locks slots for stack allocated objects. So we mess up. We don't, but this is a main, the second one is a main issue that we have with performance right now. We do not allow stack allocated objects to be pointed to each other. So obviously a heap parent would be in escaping, so that's actually handled by escape analysis. But stack allocated to stack allocated is not allowed at the moment. There's ways to resolve that, but right now we don't do it. We don't have compressed, oops, support yet. And this is mainly because you can have a merge point one side a heap object, the other side a stack allocated object. Once we're in code P, it gets stored as a compressed in the stack. Well, compressing the stack doesn't work because you cannot guarantee that the address range will be within that 32-bit space. We don't stack allocate arrays at the moment as well. We just ran out of time. There's no particular reason why we didn't do it. Primitive arrays would be simple. Reference arrays, special consideration with array copies. So we just didn't have time to introduce this presentation, come here and talk about this. And finally, thank you, Ron Pressler. We actually may need to do something special for Project Loom here. Either prevent stack allocation of objects that live across method calls because in the mode where they do the fast relocation of the stack, it's just a simple mem copy. So if you have a reference on the stack pointing to a stack object, well, nobody's there to patch it, nobody did that. So we'll get there. So now some good news, actually. So these are the performance improvements they've actually got with this prototype that we have. Being a compiler guy for me, this is amazing because I usually would work for three, four months for 2% or 3% improvement. And having a range of applications actually get significant speed-ups is actually quite good to see. One of the stack allocations, the stack allocated object would be another massive improvement if we get it right, because there are certain patterns in Scala that are very common that do have an object graph pointing to each other, which we currently reject. And so, finally, the last bit of the presentation. So when and where can we see this patch? Well, no, we're right now. So Charlie and I are in the process of migrating our patch from JDK 11. There's no particular reason why we picked JDK 11. We were looking at sparks, sort of continued down that path from the builder we're using, where migrating is the tip, cleaning out the code. And as soon as it's done, we'll actually post this to the compiler dev mailing list and ask for review. That's the plan. So our next steps from our perspective are that we have to stabilize the prototype and clean it up. We have a lot of crashes. We haven't looked at every of those methods of benchmarks. We couldn't run because there were issues. Started working on removing the limitations one by one. Stack allocated stack allocated would be probably my first pick. Right now, we only support G1 and PGC with our heap, with a mark extension to walk stack allocated objects. So we need to extend and see how it actually works with GC models like Shandoah or ZGC. And finally, look for more opportunities in other real-world applications. I want to see if we can actually improve, you know, the various REST frameworks that are out there that people build with and maybe elastic search products like that. And which leads me to the end of the presentation here, which is, you know, if you like what you saw here, please stay in touch. We'd like to actually work with everyone here. Both Charlie and I are really new to the code base and need a lot of help to actually make this a reality. And, you know, helping us review the patch would be awesome if anybody is willing to do that. That's it. Thank you. Any questions? Do we have time? Five minutes. Are you all completely stunned by that? Oh, not everybody, evidently. Yeah. So the question is, can you say something about your write barrier implementation? You said that it's always a performance when, but if the stack allocation fails, then presumably you've got a more expensive write barrier now. That's right. So we currently, and that's exactly the where I was going, I have actually two upending slides, which I'm actually going to talk about the reference to reference issues which leads me to the stack write barrier. So we currently remove the write barriers on stack allocate objects because we are sure that when we make it a candidate for stack allocation, there will never be coming apart of something else. So if you store a field, if it has a field, and you write into that field, you don't need a write barrier because nobody's ever going to see that object. It lives on the stack. Now, the reason why we can't do stack allocation to stack allocation is exactly this case. Let's have this example here where we have two objects pointing to each other. Now, we get to the bottom part. We load the original test object from the wrapper. We do t.x. Everything's good. We remove the write barrier. Now, what if there was another coin between like this? And this actually gave us a heap. Now, coming down here, it's either a heap or a stack object down this t1.x. So we don't know. In that case, to write, we can actually detect this case with analysis and reject it, which would reject certain candidates, or we extend the write barrier to actually look at the stack range and say, yeah, this falls into the stack. You're good. Keep going. Don't worry about that. So it would increase the cost of the write barrier if we did that approach. Very interesting results. Thanks a lot. A quick question. Have you considered allocating such objects on heap instead of on stack, but just take? No. Do you mean reserve a special heap region for this kind of logic? Like with T-labs. You're allocating labs, but just for duration of a single call or just chunk them. And by that, you can significantly simplify the requirements to the runtime. You don't need to treat special objects on the stack since everything stays on heap. Well, I have to consider how that would work. I can't think of double my head, but yeah, we'll think about it. Maybe there's a way that we actually can do that. We didn't consider it. No. I think it was a question here. Just looking forward to trying out the path when it comes out, because in the memory API that I showed before, we had some benchmarks that were very problematic and stressed the flow-sensitive case that you were mentioning before. Yeah, especially with the immutable trend, and I love immutable objects myself. It actually creates copies every time, which we're hoping that this would actually take care of. Charlie. So I just want to be clear that there's still a limitation that there's no partial escape analysis here. There's no lazily standing stuff back up. Yeah, absolutely. So we can look into that next. I mean, right now we're not doing it. So anytime we see something that escapes, for us it escapes. It doesn't matter if it's a cold call or something that's never reached, we're just... Okay. Okay, well, I think we're done. Thank you. Thanks.