 Hi everybody. I'm Christine Flood. This is an exciting time to be a GC Weenie. For the first time in a decade we have two new collectors happening in OpenJDK. Plus G1 is getting better and better all the time. So I wanted to sort of frame a conversation around how do we actually compare these collectors and how do you tell which one is better? We really want to have a technical discussion about collectors rather than a political discussion. And that's what I'm hoping to do with this talk. It's sort of, it might be, if you're a real GC Weenie it's going to be obvious and if you don't know anything about garbage collection then it's probably going to be kind of not that interesting but if you're in the middle hopefully this will get us all on the same page in terms of what's important. I did this talk and Alexei gave me a lot of feedback so I put his name on the slide but if there's anything that's wrong or a mistake it's mine. Thank you very much for that vote of support. Okay, so what do we really care about when we're measuring GC performance? People show pause times, people show throughput, people show footprint. Footprint is actually getting more and more important and it's one of those things that up until now has been pretty much hidden in the JVM. You have a card table, nobody talks about it. The container folks all say, you know, I have a 200 megabyte heap, how come my process is using 400 megabytes? And that's because there's other data structures inside and we need to talk about those now. So these are sort of the three metrics that I'd like us to talk about when we're talking about different GC algorithms. You know, how fast do I get my answer? What's the end-to-end run time? I want to talk about the footprint, how much memory does the program use, all of it, measured externally. So I don't want the program reporting it, I want to get it from external tools. And the responsiveness, again I want this from external tools, right? I don't know if you guys know it but sometimes when you look at your GC logs they tell you the amount of time you've actually spent in GC and not really the amount of time it took to stop all the threads and start GC and start all the threads again. So I really want an external measurement of what are the pauses and I want to be able to have a metric which we'll talk about in a little while about what that is. So how do you measure throughput? This is one of those things that's really hard. Back in the day I've been doing this off and on for a long time and we used to make a run ten times and cross out the highest, cross out the lowest average the rest. It really, wall clock time is hard and I'll talk about that a little bit more. But we also have, thanks to Alexa, we have JMH and that does a lot of this for you. So for wall clock time I just use time to measure it and this is actually a little bit, I mean this looks really boring but if you look at it it's even got a little interesting things in it, right? I picked a program that I'm calling Scooby that is really hard for the garbage collector. I know a lot about garbage collection, I know how to make a program that's going to be hard and make the garbage collector miserable and that's what this is. Here you can see that when you run parallel and serial, parallel does take a millisecond and a half less but look at how much time the user spent. It spent like a minute and nineteen seconds of CPU power, right? If you're trying to measure your CPU cycles that's something that you might want to keep in mind when you're trying to talk about two different garbage collectors. Wall clock time is a bad measurement, right? It's not repeatable. I just did two runs of the same thing and there's a difference of three milliseconds or thereabouts. So, you know, don't use this. Unless you're running something really big like spec.jbb, don't use wall clock time. Use the microbenchmark harness. Here I just ran, again I'm naming my program silly things but if you use JMH you can get a much better metric. You can say, you know, with 99.9% certainty that it's going to be between 254 and 294, right? So, in ops per minute. So, you get a confidence interval and this is something you really can use to compare two garbage collectors. It's something that's repeatable and reasonable to use as a metric. So, this talk is a little schizophrenic. Bear with me. It's partially what you want to do to measure your garbage collectors but it's also what you want to do if your garbage collector isn't performing because that's what I know how to do and I want to make sure that we're all talking about the same thing. So, I use PERF and it tells you instruction counts, cycle counts. They can be different and that can be important. It can tell you cache misses. Way back in the day cache misses were the thing for parallel GC finding out all the global variables and getting rid of them. That's an important metric, TLB misses. Basically, these are all the metrics that you can look at to see how well your garbage collector is performing and it does differ, right? You can get different cache behavior depending on whether you do breadth first or depth first search of your tree when you copy objects. You can rearrange your objects in a way to get better cache performance. So, I think these are all metrics of how well your garbage collector is behaving. So, as I said, this is sort of schizophrenic but I wanted to show you exactly what I do when I want to get an instruction count. I don't know if you guys use PERF but you can also inject the jittered symbols into PERF so you can see exactly what's going on when you run it and you can get a report that gives you where you're spending your time. So, for this particular torturous example, we're spending almost 90% of our time in the JVM. It's all in garbage collection. You can see here, I'm picking on parallel GC because it's one of the older garbage collectors and there's nobody that's going to throw tomatoes at me. But you can see that here, you know, you're spending 88% of your time in garbage collection. This is an important sort of metric to run so you can see how well your garbage collector is performing. If you had a really good garbage collector, let's see if I can do this to go back. You know, the jittered code would be running, would be a much higher percentage of what was going on. Or if you had a program that was less perverse, but just to show you what you can see in. I've run those, you can also run PERF STAT to give you a high-level overview of what's going on, just to give you an idea of how well the garbage collector is working. Measuring footprint, this is important. How big, how much memory are you actually using? And I gave you two tools here because they're sort of filling a different niche in my mind. PMAP will tell you, will give you sort of a static view of what's there. But I prefer TAP because it can give you sort of an overtime how much memory am I using. And the thing that's interesting with this is you can actually see while watching TAP that some garbage collectors give back memory when they're not using it. So if you've provisioned your heap to be between 10 megabytes and 2 gigabytes and you get into a steady state, does the garbage collector give the memory back to the LS? This is really important when you're running in a container and you've got your process sort of quiesced and it doesn't need all that much space anymore. So this is one of those metrics you might want to think about if this is important to you. What is the memory behavior over time? And so what tools do we have to look at our memory footprint and figure out what's going on? Native memory tracking. Again, I've got the command lines here mostly just so that if you guys are sitting there you can look at them and just copy and paste them. But it's interesting to look at what native tracking tells you where your memory is going. So here's parallel GC and serial GC. Their heaps are the same size. But you can see that parallel GC has more threads so it uses more thread memory, which makes sense. I didn't at first, but when I looked at it I said, okay, that makes sense. The GC internal data structures are different sizes. So parallel GC has more data structures. Let me make sure that's right. They're about the same and committed. Oh, no. No, so 105 megabytes versus 6 megabytes in terms of GC data structures. So basically you want to look at how much extra memory your GC is using. It's not really fair to blame Shenandoah for the extra forwarding pointer that we keep in the heap unless you also account for the card tables and all the other external data structures that the other garbage collectors are using. So you really want to go to residents set size to sort of compare them apples to apples. If you're trying to diagnose your garbage collector, I know this tool is no longer supported. I'm going to be done way ahead of time. Don't worry about it. Jay Hat, it's not supported anymore, but it's really, really useful. It gives you, you can look at the live data size of your heap and you can see how close your garbage collector is to the live data size. How much floating garbage? How much slop is there? You can have a Java memory leak in Java. You can have a memory leak in Java. So if you are somehow holding on to things you didn't expect, Jay Hat will show you that. And this just gives you an idea of how that works. GC logs, these tell you how your memory is running, but it's just the heap. But one, I mean, if you look at that red one, you did a young generation GC where you did, you know, 62 milliseconds of copying heap and you got nothing back. That's probably not a good fit between your algorithm and your application. The algorithm is trying really hard to collect young stuff and this particular application has no young stuff. So if you can look at the GC logs and see how effective they're being. So this is the one that we really want to talk about. Now that we've got low latency GCs coming down the pike, what is the memory responsiveness? And the tools are not as good as we would hope. We have GC logs and we have Jay Hiccup. And I'm going to go into what we really want soon, but this is what we currently have. So we have the logs and you can look at the logs to see what the pause times are. And these pause times are pretty high. So you know that this isn't going to be a very responsive collector. But we want to be able to have a more fine-grained way to talk about pause times. You can use Jay Hiccup from Azul. And that gives you a histogram of pause times and tells you where you are. This is better than nothing, but it's not what I want. Here's Jay Hiccup from my awful program that shows that 50% of your pauses are way too big. So for this particular instance, it's showing you that this is not a good match. I wanted to run a regular program just to show that I'm picking on people. For a regular G1 program that did sort of a generational behavior, you can see that you can get like 95% of your pause times under 10 milliseconds with G1, which is really good. But we think that with the low latency pause time collectors, we can do even better. So the metrics I wish I had, and some of these were actively working on, and some of these I just wanted to put on the table for discussion. I would really like a tool that gave us the total memory over time. I showed you what Top did, but if we could have something that just gave you a graph of memory usage. I worked for Red Hat. I went and talked to the Linux we need. I want this tool. And if anybody knows of a tool that does this, that would be great. But I just want to put a tool that I pointed a pit of memory usage over time. Wouldn't that be swell? I'd like something, when you have a pause time that gets reported, you say, okay, it ended at 250 milliseconds, and it took 50 milliseconds. I'd really like some way of graphing that with the start time and the end time so that I could create graphs like this, where the black lines are your Java threads and the red lines are your GC threads. I find this a lot easier to look at and figure out what's going on. In fact, with parallel GC, it looks kind of like this. So your Java threads are running, and then you have the whole machine doing GC, your Java threads are running in a huge GC. So when you have a graph like this, it gives you a much better idea of what's going on in the garbage collector. And I did one for G1. These are real runs, by the way. I actually took the data, munged it, put it into LaTeX and drew it. For G1, you can see when we start, we're doing really well. You've got the Java thread runs and the GC threads run and the Java thread runs and the GC threads run. And that looks really cool. And then we get to it and we fall off a cliff. And then at the end of the full GC, it says, oh, maybe I should do a concurrent mark. And that's what the blue threads are, and you can see those. I would really like to be able to display this kind of data so you can visualize what your garbage collector is doing. Now we're down to the metric we really want. What we really want is something called minimum mutator utilization. Watch, this is not what we really want. We really want the next one. But we'll start with this one. Basically, out of every 10 milliseconds, how many milliseconds did the Java thread get and how many milliseconds did the garbage collector thread get? And so when you use this kind of a metric, you actually have something where you can compare apples to apples. What you're going to get out of it is the answer to, am I going to be able to guarantee your response time of 10 milliseconds? Is the Java processor getting enough time out? And this isn't mine. This is from Chang and Blalock from paperback 2000. It was improved. MMU isn't good enough because the size of your window, your window can get bigger and your Java utilization can get smaller. So this is what we really want. We really want bounded memory utilization. This is the metric, and we want to get it from externally. We want to run a Linux tool that runs against your Java program and measures what's going on and gives us bounded memory utilization. And this is the number we want to use if we're comparing low-latency garbage collectors. So I wanted to frame the conversation. That's my answer. There are other metrics that have been used in the GC literature. So I'm going to put them up there. I don't think they're as important, but they're still interesting. The bytes copied per bytes allocated. This is an interesting measure of how hard the GC is working. And the thing that's interesting about this is everybody poops CMS. CMS wins on this one, right? Because the old-generation stuff doesn't get copied. So in terms of bytes copied per bytes allocated, it does really well until it hits the fragmentation problem and falls off. But if you're running in a container and you can just shoot it and start it up again, you don't care about fragmentation. So there are some cases where this is an interesting metric. What? Epsilon does even better, yes. Epsilon does even better, yes. It might fall off the cliff sooner, though. I'd love to see generated live data graphs. J-Hack gives you this, but this is something that it would require running for a long time. But if we could give you the live data graph of your program over time and then give you the garbage collected the size of the heap that's actually occupied, this is a great way to compare two garbage collectors. Objects visited provides reclaimed. One of the problems with concurrent garbage collectors that do a marking of the entire heap is a lot of times there's, you know, this little bit over here that's changing and this 200 gigabytes of tenured stuff that you end up walking every GC cycle. So I think that that is a metric that's pretty interesting, is, you know, can we get that down? Can we somehow summarize the old generation but not with the remembered sets like G1, but with something that's very small and concise to be able to get this metric down? And fragmentation costs. Different collectors have different ways of handling how they allocate their storage, how they do their regions. You can have CMS gets fragmented and you can have lots of Swiss cheese like old generation where you can't put anything. But you can also have T-Lab allocations that have space at the bottom. You can have region allocations with space at the bottom. The question is how many of these holes are there in your heap and can we keep track of those and report them and which garbage collector does better with not leaving a lot of small holes? And I'm done and I'm early. I told you I would be early. So I do have time. If anybody has any questions for me, I'm happy to answer them. And I have probably a minute or two. Yes. Do you trust your GC logs? I would like to make that better. Like I said, it's an exciting time for garbage collection again. People care. And so we're at a point where we can start all of those metrics and making them available. I would like to do that. I can't commit to it right now because I have a lot of things on my plate. But I would love to have BMU metric from an external Linux tool that ran against Java. I would love to have sort of a graph of what's going on so that you can see visually rather than having to parse GC logs. If anybody in the community wants to do this, it's all there and I would cheer you on and you can have my first born son, he's a pain in the butt. Yes, I would be very excited. Anything else? There is no way to fix that. It has to do with what's going on in your machine and sometimes inside the JVM there's word magic that I can't explain. I seen CliffClick do the same talk and he couldn't explain it either. So wall clock time is not, you really want JMH if you're running a microbenchmark. And if you're running a really big, maybe for two hours then it all gets averaged out. Any other questions? Yes, you're absolutely right and I don't have any insight. I wish I did, right? I wish that there was, you know, part of the problem for me is that we have write barriers and it's a Java thread hits an object in Shenandoah if a Java thread hits an old generation object about to be copied. It has to copy that. So that has to somehow be accounted for in garbage collection time, I don't know how that's going to work. And it's hard because that object copy is so fast. All right, I guess I'm done unless there's anything else. Try all the garbage collectors. Really.