 Here to go Hi, and my name is Stefan Johansson. I work at Oracle in the hotspot GC team I've been working with GC for the last seven years But I've been at Oracle for almost 14 years now working closely to the game and the whole time okay, and So my focus in GC has mostly been G1 And that's what I'm going to talk to you about today If we go back some years to when JDK 8 was released G1 was kind of the new cool GC Gettling up a lot of bus and by some maybe being perceived a bit like bus light here in Toy Store Hence the title This is not true anymore. Instead. We'll have new other cool GCs. We've heard about one today that gets most of the bus And G1 has matured into well-performing. We're a stable default garbage collector The goal of this talk is to show this progress in G1 and by doing so Hopefully can win some of you still on JDK 8 to move to new rate JDK releases 11 or maybe 14 Before we get started you've seen this before The agenda for today I'm going to start off with a very short introduction to GC as concept then G1 in a very quick Fashion and then focus mostly on the progress we've done in G1 since JDK 8 and hopefully in then I'll have some Some time to do a short glimpse at the future So GC in open JDK Garbage collection is not only about collecting garbage before we can do that We need to hand out the memory and having a fast and efficient allocation Algorithm is very important and this is something that's done fairly equally to all the garbage collectors in open JDK So what we have is something called t-labs thread local allocation buffers So that when a yaw thread needs to allocate in memory or an object you basically just have to add to a pointer and fill out the object It's a little more about that's the gist of it Then when we have memory that's not used anymore Then the garbage collection kick in and this is of course the most efficient or and most important thing For the garbage collection algorithms. We have quite quite a few of them in open JDK As I mentioned you heard about Shenandoah already And when you design a garbage collection algorithm You have to take a few different content concepts into consideration and do some trade-offs between those We have many concepts, but the Three I'm focusing on today are throughput latency and footprint By throughput will basically mean the number of operations you can complete in a certain amount of time While latency is the time one operation takes to complete So if you have a long user pause that will affect your latency Footprint that is basically the resource overhead caused by a garbage collection algorithm. So For example, many of these algorithms need extra merit to be able to do the memory do the garbage collection And yeah, I'm gonna focus on memory overhead today So the current collectors in open JDK or actually I actually included CMS as well Even though it's been removed in JDK 14 I'm sorry for those of you who had CMS as their favorite, but there are no good alternatives that we can use going forward So a few words about the collectors still around We have the serial collector very basic and easy to understand collector Which has its main feature that it has very low memory overhead So if you for example run something like a function in a cloud So it can be a really good choice because you might do not do very much garbage collection The parallel DC was a default up until JDK 8. It's very throughput oriented Focusing on given the best possible throughput for Java. It also has very good average pause times But the worst case latencies can be bad because if you run with parallel for a long time in many cases You eventually run into a full collection, which has yeah, they take a lot of time So there's where g1 come in the new default collector since JDK 9 We try to have a balanced performance still providing good throughput, but also caring about the latencies How we do this I will show you a bit later The new and cool GC's focusing on low latency said you see and Shenandoah They are experimental But yeah, I guess both of the teams are working hard to make them fully supported You've heard a lot about Shenandoah already, but yeah from said you see if the main goal here is to try to provide So millisecond pause times. So yeah, we're looking forward to that being fully supported That was that about GC. So let's move over on to g1 As I mentioned the basic idea here is to provide a balance between latency and throughput To build to do this. We have two Big concepts in g1. Of course, there are more things needed But the two big things are that it's really regional based and we have concurrent marking What we mean by region based is that we divide the heap into several heap regions So for example, if you have a 10 gigabyte of heap, we try to have 2,000 regions But that's hard because you want a multiple by two region size She'll have a few more, but we'll we'll use a four meg region size Those regions can be used both for young and old generation So do you want to still generate generational with two generations? But these generations are not big chunks of memory, but they are a set of regions instead What this gives us is that we can collect a few old regions at a time We don't have to collect all old regions at once which is the case for serial and parallel But to be able to do this, we also need to know what's live in those old regions And here's where the concurrent marking comes in and a woman Mentioned previously. I won't have time to explain how the concurrent marking works But yeah, look it up So basically when you have the marking information in place and you have a Region-based collector you're able to collect a few old regions at the time We call this mixed collections and by doing so we are in many cases able to avoid the long and costly full collections Another thing with g1 That might have been a problem in the past But we're working hard on it trying to make it true is that we want it to be easy to tune You shouldn't be able to have to set a lot of different flags to change the behavior So we have a main tuning knob. It's the pause time goal and If you want to increase the throughput you can increase this value if you want better latencies you turn this value down So the default value here is 200 milliseconds that has shown us that it's a pretty good default It gives a balance between latency and throughput But your application might have different needs So try to in this the first thing you do if you want to change the behavior of the one So the current status of g1 and it was added in JDK 6, but the support came in JDK 7 you for After that we worked really hard on making g1 sort of complete adding some really necessary features to it in the JDK 8 for time frame and also improving the performance a lot One such feature is conquer class unloading of the concurrent mark So you don't have to rely on a full collection to be able to do class unloading because you once tries to avoid full collections So in practice you would never do class unloading in JDK 8 where the class unloading of the concurrent mark And a lot a lot of other features making it more stable and more mature and in JDK 9 We decided that it was time to make g1 the default collector a somewhat controversial decision Because people thought that parallel has better throughput But we saw that having a balance between latency and throughput is very important And the fact that we are also working really hard on improving g1 We want the users to benefit from all those improvements without having to switch easy So making it default in 9. I think it was a good decision Yeah, that's the background. Let's look at the progress or what we've done since JDK 8 We've done a lot around 700 enhancements to g1 since JDK 8 Some of those are big features, but a lot of them are just small enhancements improving small parts of the garbage collection algorithm Those together show some really significant improvements and we see those across all areas It's not done like we're only improved latency or only improved footprint We managed to improve all areas and the way we've been able to do this is that we've improved on old efficiencies And in some cases been able to cut away trade-offs that were done in the early days of g1 So yeah, we see some really significant improvements in g1 So throughput One of the big things we've done to improve the throughput in g1 is that we've improved the NUMA awareness. So G1 has always have a very basic NUMA support since the Java heap itself has a basic NUMA support But what we've done in in the late latest release is that we now actively try to allocate Java memory on a local NUMA node giving better performance The same goes for the GC, the GC tries to keep the memory on the same NUMA node as it was allocated We still have more ideas and more work to be done in this area, but those things really showed some good improvements We also spent quite significant amount of time on making the concurrent work more efficient So the important thing here is to try to keep the GC out of the way making sure the Java application can use as many resources Or all the resources in the best case So this can be that making the marking more efficient in itself But it can also be delaying the marking or making sure that instead of running five marking cycles We only run two because we we can still avoid the full collections So the work we've done there is also Proving to be really important for throughput We also added a parallel full collection to G1 This can be seen both as a throughput and latency thing, but if you want to tune G1 to be more throughput oriented You might be able to suffer or take the hit of a few full gc's if they take not extremely long amount of time So having a parallel full gc that works kind of similar to the other full gc's out there It's really important for G1 if you want to tune it to like work well in in batch work scenarios So let's take a look at some numbers from the throughput improvements I'm using spec ABB 2015 here the results we're looking at are the The throughput metrics and they're all throughput metrics from spec AB if you're familiar with this benchmark It's run with a 16 gigabyte of heap and I I normalize the score towards JDK 8 and parallel because that was the default back in JDK 8 So as we can see in in JDK 8 G1 was behind But we've been able to close this gap in JDK 11 and 14 We're around 10% better when it comes to performance or throughput performance This is of course not only gc improvements the whole Java platform has been made more efficient and performance better But having the gc G1 especially keep keep more out of the way has really helped improving this performance Yeah, like letting Java run on the CPUs instead of the gc running on CPU really helps here Yeah, let's move over to latency And This is an area where we've yeah most or most but at least a lot of the enhancements gone in here We improve the parallelism in a lot of the different gc phases and making sure that even though a face seems to be pretty small We make sure that it's running running parallel and and take a short amount of time to keep the pauses as short as possible We also worked very hard on making those Phases more efficient for example reference processing that's Java line references That phase has been both improved when it comes to parallelism and the way the efficiency of the processing So if you have an application where you in the past in problems with with reference processing It might be a good idea to check out the later releases We also improved a lot of pause on pause time predictions and what I mean by that is basically Do one tries to predict the number of regions it can collect Keeping the pause target set by the user If those predictions are bad, we might take too many regions and then not be able to keep the pause time target So working on this is really really important if you want like a predictable latency story Another part of this predictable latency story is the bottom of the mixed collections And as I mentioned before mixed collections of the collections where we collect a set of old regions Those are a bit harder to predict because we're not doing this as often and they have different characteristics from the young regions so What we do here is that instead of like before these starts select a set of old regions that we have to complete We select a set of old regions that we try to complete and then we complete as many as we can Until the pause time target is met or we don't go over the pause time target if possible in some cases We still do but this is a good A good improvement to try to achieve that always keeping the pause time target goal that we have Yeah, let's look at some results here Once again spec in the 2015 results, but this time we're looking at the throughput requirement that the throughput risk The throughput metric but with latency requirements So these are basically still throughput scores, but they are affected a lot by the latency Provided by the Java platform Again, it's from the same branch. So it's 16 giga heap We see here that g1 in JDK it Not very impressive But we've done a lot of work So in JDK 11 Around 10 15% up I think but the big thing comes in JDK 14 where we're more than 40% better than Parallel in JDK 8 and even more if you compare to g1 itself The cool thing here is also to me cool thing It's the last bar here So there I set the pause time target to 50 milliseconds instead of the default 200 milliseconds By doing so you can see that I improved the The latency score Towards the default This of course comes with the throughput cost But if your main goal is to have good latencies It's very easy to improve that by just tuning the pause time goal Another thing to mention is like the average pause time from JDK 8 for g1. The average pause is around 160 milliseconds So we're still we're still below the pause time target In JDK 14. It's down to 100 milliseconds. So we've done some quite significant improvements here And that's really nice to see The last thing I want to talk about is The footprint and As I mentioned, this is memory footprint Something we heard a lot it when it comes to g1 is that there are members that's take up way too much space And that was true back in the day The remember sets are the data structure the g1 needs to be able to collect a region So all young regions always have remember sets in the past old regions also always had remember sets But we only collect old regions of the concurrent mark cycles So for much of the application run We kept those remersets around without having to without needing them basically And So what we realized was if we can rebuild those remember sets during the concurrent cycle And just having them around when we actually need them that should provide a much better user experience And I'll show you that in the next slide how much this gave us We also improve the sizing ergonomics a lot since the ADK 8 time frame And this is basically making sure that besides the remember set data structures correctly with regards to To the region size and stuff like that making sure that yeah, they follow a good pattern Another thing that I want to mention is the way g1 return memory to the operating system when it comes to Java heat memory In the past this was only done after full collection and as you remember G1 tries to avoid full collections. So basically we never return Java heat memory even though it wasn't used And nowadays g1 can return heat memory after a concurrent mark cycle and That together with the fact that we also can schedule periodic concurrent mark cycles can really help out if you have an application Which have kind of an idle state or something like that? Where you wanted to behave better when it comes to memory footprint? I don't have any slides to show those kind of improvements, but I think you get the idea And instead I have a slide that shows the improvements down to the remember sets And I try to use the KV for this, but they don't have where many old regions and objects So I had to use something we call the big rom tester to be able to show this This is instead a benchmark that's tried to mimic kind of an in-memory database Keeping a fairly large live set with a lot of references between the objects Which is kind of the worst case for the g1 remember sets This is wrong with the 16 gigabyte of heap and as you can see in JDK 8 We used around 4 gigabyte of Extra-native memory to be able to support a 16 gig heap. So this is a 25 percent overhead So I really understand the people complaining about complaining about this What we've managed to do in JDK 11 when we added the rebuild remember sets at concurrent mark was to push it down to around 2.7 2.8 gigabytes Still quite a lot, but the improved ergonomics around the sizing also really helped out. So in JDK 14 We're down to around 1.6 1.7 gigabytes A fun thing is also to notice the kind of saw pattern you see here So basically the memory usage go up at the concurrent Morgan cycle end and then while doing the mixed collections It slowly decreases until no more mixed collections is done then go go down to the kind of stable state again But yeah, the big roms tester is stressing this quite a lot. So we're having back-to-back concurrent cycles And is it's not only that we improved footprint for this benchmark as I mentioned earlier sometimes you have to do trade-off between latency and and Footprint and stuff like that in this case we managed to improve both areas So the average pause time in 8 for this benchmark was 1.7 seconds with the default pause code In JDK 14, this is down to 360 milliseconds. So that's also quite a significant improvement in pause times We're still over the target, but that's also kind of it's yeah, it's a really nice nasty benchmark in some sense But good for finding problems Yeah, that's basically nothing I will have time for the future And So these are the main three investigation areas going forward your mongoose object handling your mongoose object are basically a Bit simplified objects larger than a region in G1 Those can add up to fragmentation both within regions and between regions and we want to improve on this There are ongoing discussions on on how to do this most efficiently in the open mate open JDK mailing list So if you're interested, please subscribe and and follow the discussions there or join in Same goes for improving right barriers discussions ongoing there as well We have some different ideas on how to improve this The main reason behind this is to try to improve the G1 throughput because right now the G1 right barriers It's a bit more expensive than the other ones But we have we have plans on improving this as well and again footprint reductions as you see We still are have like 10% overhead to be able to support that benchmark. We want to to cut that down even more Yeah, the key takeaways from this presentation then so we've done massive improvements to G1 since JDK 8 and if you If you have an application running G1 on an older version I really encourage you to try out JDK 11 or JDK 14 I'm sure that's gonna give you a performance boost if you're not running G1 You should move to a later release and try it out because it might help you out We also have some really exciting feature features that IDs in the path And I'm sure that they all gonna help us bring G1 to infinity and beyond That's all for me. Thanks We have a few minutes any take is Hi, what's the lowest pause time target that is practical that you've seen? Well very hard to say it depends on the application We we never force it much lower than four That's not true either, but said as you see setting a pause time goal at 50 is okay Setting in the 10 you use I mean you should try out said you see or or Shenandoah or something like that because G1 is focusing on a balance between latency and throughput. So going all the way to the really ultra low latency It's not really a goal But yeah, I think 10 milliseconds should be okay in some in some cases depending on how the application looks You will do a lot of GCs. I I Think the total use the time in there in there when I tuned the pause time goal and went from 200 seconds Or yeah, total time 200 seconds to 600 seconds So you trade away throughput when you do that kind of thing. So you have to keep that in mind You Mentioned making the right barrier faster. Yeah, but the right barrier should only be happening if there are interregion pointers Yeah, is there any plan to make there to be fewer interregion pointers? So you mean increasing the region size or? It's something that we've observed that we go into the right barrier a lot and we wouldn't expect that if everything was in the new gen Oh Okay, yes. Well, eventually you have to promote objects to all if they really or You could have just one generation, but might not Yeah, they're always trade-offs For G1. We're not looking into liking it makes make single generation at right now or ever I would say But Something that we have thought about is making the regions larger and that way having fewer in like points between regions Not sure that really helped you out, but Thanks for the question. Thank you anymore Everything is crystal clear everybody will move to JDK 14