 Hi, I'm Roman, I work for Red Hat and I want to talk a little bit about what we came to call Shenandoah GC2.0. First of all, we don't really have a versioning scheme, so we had Shenandoah 1.0 about a little bit more than a year ago, that landed in JDK 12, and then we had some ideas back at the last foster and that developed into a significant change in the GC algorithm that we came to call Shenandoah GC2.0. So what's happening? I want to give you a little bit of an overview of what is Shenandoah and how does it work, and then I want to go into the old versus the new barrier scheme that we are using, new platforms that we are supporting, then about the elimination of the forwarding point of view on how does it work, then what we call the self-fixing barriers, and okay, forget about this class on all these stuff, I don't think I have time for this. So what is Shenandoah? Shenandoah is a garbage collector in open JDK, also called memory manager, whatever I want to use it. Speciality is it's a concurrent GC, which means that it does all of its work concurrently with the running program. It is universal in several aspects, one is that it doesn't require any special support from the operating system, which makes it very easy to port to new OSs, and doesn't require any special thing from the architectures, like, I don't know, pointer stuff or something, the only thing that it requires is a comparison swap, which is pretty, you know, every architecture has this by now. It also means it's easy to use, so you don't need to learn about dozens of command line flags to configure this GC. It should be very usable, and the aim is to solve the GC post problem, so what we achieve is GC post is in the range of a millisecond or so. Regardless of heap size, works with smaller sheeps or large sheeps, regardless of slow or tiny hardware, or can work very well in constrained environments, et cetera, et cetera. You may have heard from us, maybe, that Shenandoah requires more memory than other garbage collectors, that the barriers are rather complicated and hard to optimize, and actually we did lack some optimizations that would have been possible with other GCs. That is no longer true, and I want to show you why. So put this into perspective, the landscape of garbage collectors in OpenJDK is like this. We have the serial GC and the parallel GC, and they do all of the garbage collection work while the world is stopped. This is this orange block here. So in the young collections are stopped the world, and the old-gen collections are also stopped the world. This makes it easy to do and very performance, so if you want high performance GC, take the parallel GC or maybe serial, if you run in like a cloud environment or something. We have CMS, which still does the young collection while the world is stopped, but this is usually very short. But the old-gen collection is concurrently. The gutter here is that it's not a compacting GC, which means that it tends to fragment the heap over time, which probably needs to have a full collection at some point, which is the stop the world and takes the pause. We have G1, which does this marking phase concurrently, but then has a stop the world compaction phase, and the strength here is that with G1 this can be adopted so that the pause time here is sort of under control, but it still can do some damage if the heuristics fail or something like this. We'll talk about G1 later, I think. And the new GCs, Shenandoah and also ZGC, they don't have a young generation, but the whole heap is collected concurrently except for a few tiny pauses here and here. We have concurrent marking and concurrent compaction, which means this solves the fragmentation problem that we had with CMS. The concurrent marking, I'm not going to cover this because I've not enough time. If you want, you can look it up in the garbage collection handbook or online. You will find information that more talks that Alexi did, for example. Sorry. Concurrent compaction, how does it work? It's easy if it's not concurrent, right? So when the world has stopped, you usually copy all the reachable objects that are in the collection set to empty regions and then update all the references that point to the old copies to point to the new copies now, and then you're done. This is easy while the world has stopped, while no Java programs is moving, but how do you do that when the Java program is running? The problem is that as soon as you copy an object, you have two copies of the object and you need to make sure that it stays consistent, right? So you start out with all the pointers that point to the old object. We make a copy of this guy. At this point, we have two copies of the same object. So now what happens when one thread is updating one copy here and the other thread is updating the other copy there? So what's correct now, right? So this is the basic problem that we solve. And the solution in Shenandoah is that we put an indirection here. So whenever we want to write something, we make sure that we write into the correct copy, which is the new copy, the two space copy. And this is done by a barrier that first copies the object if necessary and then uses a compare and swap to update this forwarding pointer to point to the new copy. Or in the case of reads, this is the old scheme, right? So in Shenandoah 1.0, we had an extra forwarding pointer here and it either pointed to itself when there was no forwarding copy or it pointed to the new copy of the object, in which case we can resolve it and do the writes there and read from there if it actually exists. So this required a whole zoo of barriers in order to work correctly. So we needed read barriers, which means that whenever you access a field or an area element, you go through the compiler with some extra code that makes sure that we resolve the forwarding pointer of this guy here. And we needed write barriers for establishing the canonical copy of this object, otherwise we would have inconsistencies. Both of these barriers have been required on both object, reads and writes and primitive reads and writes, which means we have a lot of them and this can cause a lot of performance trouble. I will show you later how we could work around this a little bit. We also required these so-called object equals barriers because if you have two copies of the same object, you can make sure that when you have an object one equals object two operations that it doesn't get a false negative there because it would see the address, the address is different and say return false even though it's the same object. We required the compare and swap barriers. So for unsafe compare and swap, it's basically the combined problem of all of these above writes. So the value memory could have been different than a different copy than the one that it's comparing to. So we needed to ensure this object equals problem and we also needed to ensure that we write compare and swap to the correct copy. And we needed the area copy barriers. This is a special barrier for bulk area access. It's basically a combination of read and write barriers. For example, if you have code like this, so this is a field access here. So we would read this field foo and then we'd have a hot loop in which we access another field inside this object foo like x. We call it object a. This is a read access. We want to write something to the field y in the object foo. This is the write access. So what the compiler would do, it would insert barriers here. First, read barrier to ensure we read from the correct copy of this object foo and also write barrier because we want to write to this object here. So we have two barriers already and it's running inside a hot loop so the performance impact of this can be rather dramatic, I think. And this has been required for primitive fields too. So we needed a lot of them. We have been able to buy some jit heroics to optimize this a little bit, right? So we could say, yeah, okay, this foo thing here is a loop invariant so we can hoist it outside the loop. This is what we do here. So yeah, performance problem mostly solved. And then we also know that the write barrier is stronger than the read barrier so we could coalesce this into a single barrier. But if you look at this simple example, the idea that we have, why not simply emit the barrier right after the read here, right? So this is, you could do this much easier and this is what we did with the new scheme. We call this load reference barrier because it's inserted whenever a reference is loaded here. It's not before this access here, it's after this access here. So this is why it's called load reference barrier. Right after this reference is loaded, we establish the canonical copy. It's pretty much like the write barrier here. It works much the same. And then this copy is used for anything else. And naturally this placement is much better because the jit compiler doesn't have to do all this work to move around the barriers to get a good performance out of this. So here, this goes away. Any copies that we have inside the JVM are the canonical copy of an object right after this load. So this makes it much easier to reason about which object is which because we only have one. The write barriers, they are turned into the new load reference barriers, work pretty much the same by copying and comparing swapping the forwarding pointer. We don't need the object equals barriers anymore because we only have the canonical copy inside the JVM. We still need the compare and swap because it's the same, it's still the same problem in the memory location. We could still have a pointer to the old copy as long as we haven't updated it. But it's simpler than before. Not on the slide, I forgot. It's the array copy barriers. We still need that too because array copy is effectively a series of loads and stores. So we need to insert this load reference barrier there too. But it's much simpler than before plus it's not needed anymore for primitive fields, which makes it much, much less frequent than it was before. But we can do more of this. So this is much simpler. I already said that we have a strong invariant now, which means, I have explained it somewhere now, that the weak invariant meant that we had a, I think I explained that later, much less frequent barriers. It's much simpler to optimize. So we have been able to throw away a lot of code of our compiler optimizations just because of this. Yeah, so this simpler barrier scheme makes porting much easier. Of course, we have much less barriers to implement on a new platform, for example. This strong invariant and this much less frequency of the barriers means that we can eliminate the forwarding pointer. I will talk about this later. And yeah, so first let's start with the architectures here, the porting. So right now, janitor 1.0, we had support for x86 64 bits. This is our primary target. I copied this from our wiki. You can also look it up there. The new one here is the 32-bit support. So this was, with this new scheme, this was much easier to implement. And Alexi did it in a couple of days, I think, deriving from this port. So here, being able to run on 32 bits also implies that we are able to run on, with compressed oops, which is a nice side effect. But it's always been the case. We have support for AR64. 32 support is in development. If anybody wants to help here, it's welcome. We don't have support for any of those architectures, basically because nobody has asked for it or nobody has done it yet. So if you are interested in that, please, contributions are very welcome. Operating systems. We have Linux, of course. It's our primary target. We work with that all the time. So this is very continuously tested. We have Windows as the secondary target. It's continuously tested in CI. We have Mac OS support and solar support, which is basically done by the community. Operating systems, I said that before, are very easy. We have zero operating system-specific code in Shenandoah, which makes it much, much very easy to port to new operating systems. It's mostly a matter of making the tool chains happy and compile this code. If you have any needs here, contact us, and that should not be difficult. So the new forwarding pointer scheme. So this is how it worked before. We had the object with its header and a couple of fields, and we used to require one additional word that we kind of stuck in before the actual object, which meant that we required more memory per object in Shenandoah earlier versions. But it was easy to implement. We could simply, if we wanted to arrive at the correct copy, we basically loaded this forwarding pointer. We would end up either at this copy or at the correct copy, and then we do the work there. This was easy. And it was necessary because this is the invariant thing that I want to explain. With 1.0 we had this weak invariant, which means that all reads may read from the old copy if no canonical copy has been established. This is okay by the Java memory model. However, all writes must write to the new copy for consistency. Thank you. With the new scheme, we have a so-called strong invariant, which means that all reads must read from the new copy and all writes must write to the new copy. And what this implies is that the old copy is 100% unused as soon as we have established this one. And we can use this old copy to keep the forwarding information. So instead of having an extra word here, we can just stick it in somewhere here. We chose to do this in the header because it's easy. It's offset zero. And this requires much less memory. And the way it works is the trouble here is that we need to distinguish between if this is a valid object with a valid header or if this is a forwarding pointer. And in order to explain how this works, I want to show you how the load reference barrier is structured. So we have this, first of all, it happens after an object load, which is here. We load from this address. If the object is in the collection set, which means it's targeted for evacuation, then we first look at if we already have a forwarding copy. We decode this forwarding pointer, and then if we don't, if the object is still the same, we do the slow path, which means we copy the object, we do the compare and swap, and then we have the canonical copy here. And this is the thing that does the forwarding pointer decoding here. Before Shenandoah 2.0, it was easy. We would simply read from the forwarding pointer word that we stuck on it. Now it's a little bit more complicated. First we read from the header, we treat it as an end, and then we do some masking here. The way we do this is the header is a bit overloaded with a couple of meanings. One part of it is locking bits. Some other parts is hash code, which is also used for a couple of GC things. And the lowest, so the objects in JVM are always 64 bit aligned, which means that any object pointers have the lowest three bits, three, they are zero. So we can say, okay, the upper, if in the lower bits we have this bit pattern, the lowest two bits set, then we say it's forwarded. This happens to be an unused bit combination for the locking bits. So we can do this. Actually other GCs do the same. They have the same meaning here. So if we have the lowest two bits set, it's forwarded, and we treat the upper bits of this header as the pointer that points to the new copy. Otherwise, if it's not set, it's still the original copy, and we simply return that. So this is a bit more complicated than it was before. We can afford to do that because it's much less frequent. This doesn't happen on any, on all primitive reads, for example. It only happens in the mid-path of the new load reference barrier, and this is, so yeah, we can do this. This means it reduces the memory footprint per object compared to the old Shenandoah. In the best case, like 66%, if you have an empty object, it's only two words and an extra word meant three words. So yeah, realistically, we have a reduction of 75% to 95%. Or, put differently, we use the same amount of memory as any other GC, which makes it a much better fit. This also means it reduces allocation pressures, or you can allocate, thank you, allocate more memory in the same amount of time. It also means that we have to run fewer GC cycles per certain allocations, which translates into better CPU usage. So, I'm not sure I can do this in five minutes, but I'll try. Self-fixing barriers. If you look at this again, it's the same code that we had before. The problem, it's not a problem, but we have this very infrequent case that the object is in a collection set, and then we have the, it's not forwarded, and then we run into this slow path. So this is a very infrequent case. It doesn't happen very often. We do quite a bit of work here, and that's okay. This case that we already have a forwarded copy, but the field here is not updated yet. It's actually quite frequent, because the fields are only updated much later. So the idea here is while we're here, why not do all this work already? Why not also update this field? And then the next time we come here, we see, okay, it's not in collections anymore. We can drop through this, and yeah, this is what we did. It's basically, it's here, right? So if we already have a forwarded copy here, you say, okay, update this field to now carry this forwarded copy. Next time you come here, this is basically a no-op, and we jump right through, and this makes it much, much frequent to even go there. So, yeah, availability, the 1.0 landed in JDK 12. The new stuff is, most of the stuff is landed in JDK 13, except for the self-fixing barriers which go into JDK 14. We do back ports into our own downstream repository under Shenandoah JDK 11. And we also back port all the stuff all the way to JDK 8. So downstream repository here in Shenandoah JDK 11, and both of the stuff is used for DERA and Rated Enterprise Index packages and all the derivative of that. We are currently in the progress of upstreaming this port to JDK 11 updates proper. It's under review. We also intend to upstream the JDK 8 bits to JDK 8 update proper soon. This is made possible by this new barrier scheme because it meant much less mess that we needed to make in order to make it work in JDK 11 and JDK 8. What else? I'm basically done. So, if you want to have more information about Shenandoah, we have a wiki here. There's all sorts of interesting information here. There's mailing list. You can send any bug reports or questions or whatever there. Or you follow me and Alexi on Twitter and we usually announce stuff there too and discuss all sorts of interesting details. Yeah, thank you. Any questions? Do we have time left for questions? Two minutes, okay. Questions? Yes? You were talking mostly about moving objects around. I'm wondering the garbage collection itself. This is a pretty hard and sweet thing. Why do you mark which objects are active? You need to stop the work because I'm great at it. How do you solve this concurrently? The concurrent marking? Can you repeat the question? The question was how do we solve the concurrent marking? This will take another 20 minutes to explain. I guess. I cannot do this, so we can look it up in talks by Alexi, for example. There's technologies that's called Snatchel at the beginning. That requires more barriers. That's the short answer. The same as every other question. Can you comment a little bit on realistic heap sizes that Shenandoah can deal with? Realistic? 400 megabytes up to, I don't know, we tested it on two terabytes, I think. It doesn't really matter. It doesn't depend on heap size anymore. Thank you. I guess you need some extra work for your load reference barriers when starting the compaction phase. Can you talk about that? How much impact there is? I think you need to scan the stacks and update all the references. The stacks and all the other GC roots at the beginning of the compaction phase. This is basically a GC roots scan. We also pre-evacuate and update all those references there to establish this invariant. You should answer your question. How expensive? It's not that expensive. As I said, we can do in the range of one millisecond pause times, we can do this. It depends on the stack sizes, basically. You showed a pause at the beginning of the mark and a pause at the beginning of the compaction. Are there any plans to improve those pauses? Can you say it again? I didn't quite understand what the question was. You showed a pause at the start of the mark and a pause at the start of the compaction. Are there any moves to improve the performance of these pauses? Yes, we have some development in JDK 14, which I covered at the first slide. We can now do concurrent class unloading and concurrent roots scanning for some of the roots. So this landed in JDK 14. To cover this, I need another 20 minutes. So I think we are done now. But yes, we are in progress of doing this. Thanks, everyone. Thank you.