 Okay, so hi everyone to today's last talk about what the GC team has been doing since Chittiki 9. My name is Thomas Schatzel, I'm from the Otto-Chaber-Hautsburg Regiment Machine GC team. So I'm standing here, chained to the desk, and two somewhat awkward thoughts crossed my mind. First this one, and second, we've been hearing a lot of great talks today about the future of Java, what has been done in Java, how to use Java, how to improve it, and now you get a talk that's about the guts within the rear. And while I tried to tone it down a bit, it's going to be a bit technical, well, bad luck. Anyway, I've brought with me five improvements we've been working on since Chittiki 9, and called participation, because we think that the GEM lives and dies by your input, and I'm going to tell you what you can do. But let's start. So first parallel full GC. So there's always been the problem that G1 full GC is very slow. I mean, it's serial, things read it. So G1 has very high worst case latencies and really bad throughput. So in this change, we tried to make the G1 parallel full GC on par with the parallel GC full GC, and the solution was basically, yeah, make it parallel. And here's one slide about some results with that. So basically, we are comparing G1 parallel full GC with parallel GC full GC on a few applications that I found rummaging on the backtrack on the mailing list. They should kind of represent applications with different lifeness, different connectivity. And yeah, there's the system GC test that performs many system GCs on a very small life set. This big ram tester application or microbenchmark, basically, that's basically a big LRU cache. It means a big array and you add a area of references and you add and remove objects in an LRU fashion that has, at the point of full GC, a huge life set and lots of references. And there's this three-fragger of regga. That's a fragmentation-inducing benchmark from Red Hat with a medium life set. And without going into the detail, yeah, it looks, we are there. I think 32 or something like this, yeah. Okay, for everything fine. This is civillial since JG10 built 33. Then let's go to the next topic. Sorry, has there been a question? Okay. Faster card scanning. Now I will likely ask you, what's card scanning, what's card scanning. So let me explain that a bit. In this figure, you can see the shower heap, this bit in regions, the blue boxes, it contains objects and there's references between each other. And during GC, we are going to move that one object. Now we have a problem. What we are going to do with these references, I mean, if we just continue the application, it would very likely crash at some point when de-referencing that. So there is something called remember sense that, as the name implies, remembers the locations where there are references from other regions to a particular region. But using that, we can fix up these references and the application is ready to go. But what does that have to do with cards and scanning? Well, the elements of the remember set are not the actual references to the reference to the shower objects, but to so-called cards, it means small subdivisions of the memory. And to actually find the references, you need to scan that area. And you need to scan this area quickly. So the solution for this change was to refactor and improve the current existing scanning code. That means the code has been really, really overly generic. It has been replaced by specialized code for different situations. This allows us to manually subsume and remove lots of checks in the code. And just showing you the results, these are the pause times of this big RAM test application, which incidentally also spends tons of time on this card scanning phase before the change. And yeah, that's the pause times afterwards. Nice thing about this, also available in JDK10. Now let's get to what we have been working on lately. And that's not in JDK10. First one is what we call rebuilding the remember sets concurrently. So there's a problem with G1, and that's the remember set. They take a lot of memory with no instances where 20% of the total heap size is taken just by the remember set. If you have a 100-gig heap, the remember set takes 20 gigs, which is bad. The upper bound is actually even higher because this is an old square-after-regions relation. And yeah, what has been noticed is the old regions use the most remember set memory, and I'm going to try to show you why by showing you how the remember set and the objects within a region change within the GC cycle that is shown above that heap snippet. So at the beginning of the collection cycle there are a few younger only, so-called young only GCs, and during the time new remember set entries are added to the remember set of the region. And at some point G1 decides to start marking the heap got full enough, and so it starts looking at which region it can evacuate. Okay, now Leibniz analyzes that this marking is working. During the time the application continues working, more young only GCs happening. And at some point there's a so-called remark pause at the time the Leibniz of the objects within the region has been determined. But at that moment actually G1 could immediately start evacuating that region that's moving, the remaining life contents somewhere else. The problem is the remember set, it's pretty big as you might have noticed at this point. So it tries to kind of remove absolute interest, the interest that means interest at cards that do not contain any life objects. For this to happen it needs to do another concurrent phase, that's called create life data map since GDK9, where it creates that map that is then used to scrub the remember sets from those absolute entries. During that time the application continues working as remember set entries to the remember set, of course. And at some point there's this so-called cleanup pause where first we know the life-ness of the objects in that region and we hopefully got rid of a lot of remember set entries that means cards to scan during the GC to evacuate that region. Unfortunately for some reason we can't do the so-called mixed GC. Immediately we need to wait for another young only GC, but then finally we can do that. There are two cases now, one is that region gets evacuated and that's a nice one, the remember set gets completely dropped on the floor and that region's empty can be reused. But what happens if that region is not cleaned up and the GC1 decides to not evacuate that region and that from multiple of those gouch collection cycle well, the remember set gets bigger and bigger while it always removes actual entries from when there's some fragmentation in the memory there and that's why it gets really big. Some key observations, G1 maintains the remember set all the time for all regions. I mean if you think about it and after seeing this animation you will notice that G1 actually needs this remember set at least for all regions only during this mixed GC all the other time it doesn't need it. And the other issue is that removing those obsolete remember set entries is costly. I mean you need to create this live data map. During the cleanup course you need to scrub what we call the remember set which can take a few hundred milliseconds actually but you don't want to. So the solution here is only to keep the required remember sets when really needed. That means only for regions that are in the so-called collection set that means regions that we are probably going to collect in the future which is much less than all regions. And of course that minimizes the fragmentation in the remember set. For that to work we have to construct this remember set concurrently between remark and cleanup which is bad but at least we don't need to do anything in the cleanup course which is bad for latency. So there is some prototype, some internal prototype which length and there was some side effects. It lengthens the time from remark to cleanup. What we measured was up to 30% longer concurrent marking cycles which we think isn't that much of an issue particularly because the dynamic air help feature determines when to start the marking. That means schedules that initial mark pause will automatically adapt anyway. Other nice side effects are it improves throughput at pause times, throughput because also of the three-bit remember set phase you don't need to update the remember set at all. All that work goes away. And you create, and from a pause time point of view, since you create the remember sets only at the time when needed there is little fragmentation. Actually you probably noticed that I didn't say anything about trying to scrub these remember sets anyway. So and the feature for the container guys, it allows G1 to implement the bounded remember set in terms of memory usage. That means when rebuilding G1 could just stop collecting remember sets for particular reasons if some budget has been exhausted which has always been a problem. I mean, yeah, 20% of the heap is probably too much. If you make that kind of safety buffer. So some numbers on that. Memory usage on that big RAM tester application. In the baseline the remember set takes 10% of maximum heap size which isn't too bad. But in the prototype, outside of the rebuilding and the mixture C phase the remember set only takes 0.5% of the total heap anymore. At the end of rebuilding though, you still need 7.5% of the total heap size when rebuilding the remember set for around 60% of the heap. This is due to various reasons. There are some constant costs there. And G1, if the remember set doesn't contain too many entries it uses a less dense representation of the remember set. And this selection policy for the regions to rebuild remember set for is pretty simple at the moment. So how does this look like from a post-term point of view? This is the same graph as before from the faster cut scanning change. And this is what it looks like now after that change. Yeah. More information can be found on this backtrack entry. It's a work in progress. It may or may not learn it to the key level. Now for the next change we're currently working on about the mixed collections. So G1 strives to keep some kind of post-term and it does that by determining the collection set. It means the number of regions it's going to collect at the start of the garbage collection and then just does everything in one go. Without caring about the current time while doing so. The problem is particularly during mixed collections predictions are pretty hard to do. And G1 unfortunately miss predicts pretty often. I mean you can tune the mixed collections to keep the post-term but it's pretty hard. Let me show you how this looks like. So you have this heap with some young regions, some old regions and some free regions. And there is this collection set if you want which contains these young regions and these four old regions at the bottom. And currently during garbage collection G1 just takes all the regions, copies the contents over either into new young regions or old regions. And yeah maybe it can happen that the post-term has exceeded bad luck. So this solution that we follow here is to incrementally collect the collection set and abort the evacuation if the next increment would take too long. It's easier to predict the evacuation time for smaller parts of the heap than for larger. So how would that look like? The same collection set and G1 starts by collecting the young regions that takes some part of the post-term budget. Now G1 sees that there is quite a bit of budget left so it takes two old generations while still quite a bit of post-term left but not as much as before so it just creates one old regions. Now we are getting really close to post-term goal and G1 simply says, yeah, stop the garbage collection we are probably going to exceed the post-term now. So there is some performance impact of that abortable mode so the idea is to only enter this abortable mode if needed to degrees that overhead. This information can be found at backtrack entries. Again, work in progress, it may or may not land in GDP 11. And on for the last of the changes I want to discuss today and that's automatic thread sizing. So one problem of G1 and basically all of the collectors is that manually setting the number of threads correctly is impossible and in many cases not even desired. If you want to run your installer, your tiny installer on a machine with 2,000 threads you don't want that installer to launch 1,500 threads. That doesn't make a really sense. Anyway, if you wanted to try to set this right number it's a lot of work because it depends on the hardware, on the application and actually it depends on the current application phase you are in. And to make matters even worse, in the hospital GBM you can only set the number of threads statically at the start of your application of the APM so basically you can't do that. But there are certain benefits on using the right number of threads for the correct situation because it saves resources, threads, memory, it has some faster startup and actually it improves faster. And our solution that we are going to implement here is to let G1 automatically decide this number of threads because G1 already takes a lot of statistics about GCs, how long it takes to copy a certain amount of objects, how interconnected they are and such things. So it seems obvious to let the garbage collector decide that. And actually G1 is cheating a bit since CityKin 9 some phases of the garbage collection actually do exactly that for performance reasons. I brought some graph of some random application here which at the beginning has some longer pauses because it starts up, moves stuff, initializing some kind of database if you want to say so. But after that there's not much activity going on. If you look at the right side of the graph it shows the number of threads used during the evacuation phase. The light blue line which basically shows the current state of the VM we always use 28 threads whatever the situation is. But in our prototype, basically at the start when there's a lot of activity G1 will ramp up the number of threads to use really aggressively but then drops them down to three in that case. And looking back at the left graph again which shows pause times, well actually it improves pause times a little which is nice but yeah, we'll take it. More information can be found in this draftJAP or in that backtrack guide. Again that's work in progress and will land in some release in the future. So let's get to the last topic I have here. Participate. We would like to have you participate because well it helps us a lot to improve G1 and also the other collectors because we want to know what is the right thing to improve. I mean I mentioned and used this big RAM test application. That's an application that has been attached by some user after noticing that his application doesn't work at all and yeah, we will look at this stuff and fix it. We know that not everybody can contribute on the same level but there is a lot of things you can do. Probably the easiest thing and the least time consumer is hanging out on the hospital GCUs report your failures, report your successes which is also pretty nice and provide answers to community. If you think you want to start with development, fixing small bugs, in the backtrack at least in the GCU team we label all our bugs that are simple with some labels called start or clean up, browse through it, look through it and yeah, come to hospital GCU Dev and discuss them. Even if you don't have a fix we will help you to help. But I also brought some interesting larger projects for the pros of you to do which would be nice for G1 of course and they are called M-Afford-Barris, throughput-Barris and Numer Support for G1. Just a few quick notes about each of them. M-Afford-Barris. So there is a functionality in the compiler that's called M-Afford-Barris and that's basically some small piece of code that is run before M-Afford and M-Afford is entered and that could be used to improve the throughput of the G1 collector. In particular G1 uses so-called write barriers for every reference write and that write barrier or part of that write barrier isn't used all the time. Actually it's used only during the very, very small marking part of the garbage collection cycle you saw before. And although, but at the moment it's in all the time, it's active all the time. There needs to be some code executed for that and that as I have learned yesterday could give up to 3% of throughput. So what some M-Afford-Barris could do is during the time when it's needed remove that so-called pre-barrier and exchange it by knobs. That means no operations basically every CPU can do nothing pretty fast. Throughput barriers. So G1 has throughput differences, deficiencies. Everyone knows them. Mostly write barrier related and this time so-called post write barrier related and there is a possibility actually to be as fast during application time or almost as fast as parallel GC and that would be by actually using the same barriers parallel GC in G1 at the cost of some little extra work during the pause which may or may not actually have an impact. More information about this my colleague Eric Duerblat had a really, really good talk about this last foster. And now let's get to the last item I have brought with me and that's Numer Support. Yeah, it would be really, really great if we could improve throughput on large multi-socket machines by exploiting memory locality in G1. There is a lot of opportunity to get really, really good numbers. There's a jab that is actually open for six years now. Maybe somebody wants to have a try. And that's it from me already. Questions?