 Hi everybody, I'm Christine Flood and I'm here to talk to you about why we really do need one more garbage collector In open JDK And I can't get this there, so I'm gonna hold on to it Shenandoah has been going on for almost four years and there's been a lot of people contributing out to it over time So I like to give credit where credit is due and all these people have contributed along the way to getting it To the point where it is now And many of them are up on the stage. This is Roman Alexei and Roland Don't give me a hard time. Okay So let's get started I've never been to Fosden before so I gave I'm giving sort of all of the Intro material to work up to where I want to be So what is a garbage collector and a garbage collector is sort of like an omniscient housekeeper, right? You bought these cookies and they weren't they weren't good. They weren't bad You're never going to eat them you leave them in your pantry and you've got this bread and it's expired But you didn't notice and it's in your pantry, right? The garbage collector can go through and see all those things that you're never going to eat and get rid of them and take The things that you do have and organize them for you. So running with managed memory really leaves you in a much better situation Okay, so I'm going to jump into the current open JDK garbage collectors. I think that they all have a place I know that there's sort of a support nightmare with these but I wanted to say that in certain situations I can see using all of them. So for example, if I was running in Containerized environment and I had 50 JVM's running on a server I might want the smallest footprint the single threaded garbage collector that has no sort of wastage Due to the algorithm and that would be serial GC if I was running something like a weather simulator And I needed the results fast and throughput was what it was all about. I Would want to use parallel GC if I wanted to minimize pause times with the current open JDK garbage collectors CMS does a young generation collection while the world stopped But it never does old generation work while the world is stopped So that can give you a minimal frag minimal pause times, but it has some problems Because it's a mark-and-sweep old generation. You can get fragmentation and your performance can degrade over time And if you want manage pause times with compaction, you want G1 G1 is as works very hard to be able to manage your pause times into whatever particular window you need But for some folks these aren't enough, right? They have a heap that's too large and pause time constraints that are too small And those are the folks that we we aimed Shenandoah at People that want to run 100 gigabyte heaps or 200 gigabyte heaps, but they still have to respond to things in 10 milliseconds So they if they know the request gets stalled because of a GC for a minute Then they've violated all kinds of a quality of service guarantees So I'm going to cut this one at the bud. Everybody asks me why is this called Shenandoah? It's because The airport PGC airport, which is cool. That's the airport designation is right outside of Shenandoah National Park That's all There's no big secret. It was just a cool little in-joke and now you guys are all in on it All right, so Traditional collectors like the serial collector parallel Parneau are all separated into generations You have Eden you have survivor space you have old space you have a card table over here And what this does is summarizes old to young pointers, so you can collect just this part of the heap Shenandoah doesn't use that Shenandoah is a region-based garbage collector and so is G1 What we do is we break up the heap into regions and any of these regions can be either old or young Depending on you know that point in time So we'll start allocating in a region and we'll allocate another region and another region Maybe we'll have a big object that's bigger than than a single region And then over time those objects get turned into garbage We'll pick the garbageiest regions which are these two and we'll compact them over there and then We'll have free regions for you to allocate into So it's it's a different if what you learned in school was this traditional semi-spaces. It's slightly different But not that different Okay, so stop the world compaction when I talked about compacting You know those garbage collectors doing stuff the world compaction what they do is the Java threads run for a while They stop and then the GC threads do the compaction and then they stop and the Java threads start up again Now Shenandoah is a different beast right Shenandoah has You're running Java threads for a while you stop you do a quick scan of the the thread stacks And then you start a concurrent mark, but you start the Java threads again You do a quick stop and scan of the thread stack and you do a final mark And then you do concurrent evacuation while the Java threads are running So we have two very very short pauses that are just as long as it takes to scan Basically your thread stacks So the concurrent marking part of that is a solved problem CMS G1 and and Shenandoah all use basically the same snapshot at the beginning algorithm All that means is that if you update a pointer It used to point to bar and now points to bass if you're in the middle of a concurrent marking cycle You have to be sure that bar stays alive because it was already it was alive at the beginning of concurrent marking So all it is is a barrier that that adds Bar to the things to be marked Concurrent compaction, that's what Shenandoah does that's new that's what's cool as That we are able to compact the objects while the Java threads are running So why is concurrent compaction so exciting? It's it's just it's complicated, right? If you have several Java threads running that are all setting fields in an object and the GC thread is going to copy the object You can tell that there's all kinds of bad things that could happen, right? Somebody could write to it and the GC thread could copy it or and we lose the update or These guys are writing to different copies So what you really want to have happen is you want your object You know all these threads to be pointing to foo and then flip a switch and have them all be pointing to foo prime and That way you don't lose anything and everything is good But we can't really do this, right? Even if we had some sort of fancy-spancy transactional memory Just the time it would take to find all the threads that are referencing foo and by the way It's not just threads it could be some objects very deep in the heap that are pointing to foo as well What all have to get updated at the same time and we can't really do that So the secret sauce in Shenandoah, which isn't really secret sauce is that we've added an indirection pointer to every object Now this is the scariest part about Shenandoah. I was afraid that we weren't going to get accepted because This pointer, you know, we've increased the size of every object But the power it gives you is worth it, right? Because you have these few threads that are pointing to foo the garbage collector comes along and makes a speculative copy of foo And it does a cast to change the indirection pointer to point to foo prime So anytime anybody accesses this object Before the cast they get the old copy and after the cast they get the new copy It doesn't matter whether they're in another thread stack or they're somewhere deep in the heap They're always going to have the current copy of foo So What this means is that all object accesses have to go through a read barrier Now this is the other bug about Shenandoah because the common wisdom back in the Lisp days was that Read barriers reads are seven times more prominent than writes And so you can't afford a read barrier But the reality is that our read barrier is extremely cheap We have one hardware instruction right before here if you want to read a field in foo You have to find the location of foo add the field offset and read that location Here we have to read the location minus eight resolve that and then read from there So without Shenandoah on intel Uh a get field is a single move and with Shenandoah on intel We move the the contents of the forwarding pointer into the register And then we do the read of the value So it really is one single machine instruction for a read barrier And as I will show you later on it is not cost prohibitive Um There's still a race condition if you're paying attention right because We have one thread that resolves the forwarding pointer And then stalls out and writes to foo And meanwhile another thread copies foo And this is a problem. So we also have to have a more complicated write barrier We have a write barrier that copies the object first before you're allowed to write to it So here you can see I've got a b and c over there if I want to write To d I have to first copy it over to two space and then write to it there Um this isn't that painful right there are other collectors out there that are Concurrent collectors and they have to copy things on read We don't have to copy things on read. We only copy them on writes and writes. So let's frequent in reads All right, and so our write barriers are pretty quick Uh these are ordered in this order for a complicated reason. I won't go into but um We check the evac and progress thread which is thread local And we read the forwarding pointer And if there isn't a vac in progress we can just skip to the store So our write barriers are fairly quick in the case where you aren't actually evacuating objects And if you are they're they're they just copy the object. They're fairly quick as well So what do these barriers cost? They're not really as much as you might think we have um phases in c2 Where well these guys are going to correct me. I'm not a c2 expert But we can do elimination of barriers on new objects and null pointers We can do read barrier eliminations on Final fields and we can c2 already does barrier hoisting So if you have a a read barrier in a loop It can sometimes hoist it out of the loop to make it less expensive So I did a little experiment. This is from um A paper I did last year where I ran some of the de capo benchmarks And I gave them an incredibly large heap. I gave them a large enough heap that they never actually had to do any gc work And these were the three benchmarks that we're able to run in that environment and you can see that Comparing Shenandoah to g1 the overhead of Shenandoah was between 2.1 and 5.6 This was a year ago. We have some numbers now where we're we're doing even better Um, are there any other gotchas to be concerned with? in Shenandoah and You need to think about things like pointer comparisons Right because if I have an a here and it's pointing to a prime If if somebody has a pointer to the old object and somebody has a pointer to the new object You can sometimes get Something saying they're not equal when they are So we compare the pointers directly and if they're not equal Then we execute the right barrier on both of them And then we can compare them both in two space and make sure that they're correct And there's there's complications like this for things like CAS as well I'm sorry All right volatiles everybody was all concerned about volatiles Volatiles just create a new state a new memory state and we don't do any optimizations across them and it's fine Um I'm going to be really quick because we're going to do a demo with laugh towards. Uh, this is elastic search Um, this is just I wrote an elastic search benchmark because I wanted to see And the only thing that's really interesting here is you can see that this was also again from 2016 Um, we were slightly slower in terms of runtime But if you look at our total pause time in milliseconds compared to the other collectors We rock if you look at our max pause times and our average pause times We also rock if your main consideration is response time and you don't want You can't afford to pay for pauses Shenandoah is huge All right, I also ran a very famous warehouse benchmark just recently um, and We actually because of some excellent work that these guys have been doing lately and optimizing our concurrent marking We are now winning in both in terms of max j ops and critical j ops So we beat g1 in this particular situation in both throughput and in response time Which is a huge win for us in terms of our Getting better performance All right, so we ship in fedora if you run fedora 25 It's just dash xx colon plus you shenandoah gc and you can try it out in java 8 Um, we're scheduled to ship in the next release of rel I can't promise that that's going to happen But that's the current schedule and we spent the last year working on productization and optimizations We are much more stable and we're getting much better performance And if you have any questions or if you want, you know to just send me email and I don't know ask where we got this jenandoah sweatshirts. Go ahead. This is me And i'm going to turn it over to lxa who's going to do the demo So obviously I did run my own benchmarks Let's see Come on Nobody will believe you if you didn't run spec jbb on your product, right? Okay Let's do this come on Can you can you see it? Yes, you can so um Shortly before new year roman and and I were having fun and actually written some visualizer tool which We will try to understand how shenandoah works internally is just some eye candy It turns out to be much more useful not only for talks but only to see how the collector actually works So let's run spec jbb under under this Visualizer and shenandoah is working. So what you see here As christine said it is the uh regionalized collector, right? So this what you see here is actually the java heap is the Squares there are regions they are color coded to distinguish whether they are Recently allocated how much live data is there etc etc right now spec jbb ramps up. So it doesn't do any Huge allocations at all. It's it just tries to do Tries to figure out what's going on there And there is a little graph on the top which is kind of the the time graph of these parameters there So right now nothing happens. We just do The heap work and then something happens then we realize that we have to start the concurrent cycle And and the thing there on the top In this yellow wish in the puke yellow color there. It is actually the concurrent mark face, right And the red kind of bar they are following it. It is actually the concurrent evac It's just to get you the feel about the size of these phases in your Regular young gc work load which mostly allocates which doesn't retain much etc etc But you what you can also see there is that Grayish line, which is the used heap actually grows through the concurrent mark face That's saying that the allocator is still working while the collector marks if you look closer at that graph You can actually also see that during evac that also happens But this is not the workload at which we will like to to show the evac For that we have another workload and this workload is really really simple I call it array fragger. You just say I have the array and I at random index I will just store the the binary The byte arrays there This workload defies generational hypothesis because the oldest project all objects died there By the way, this is the characteristic which will be endemic for any in memory last least recently used memory cache So this is the simple workload that we can use there and If we run Shenandoah with it A few interesting things will happen First of all, it has much higher allocation rate And as it goes you can you can see clearly that The oldest elements in the array gets rewritten and so regions that we're Having lots of life objects will have less and less life object up to the point You have to deal with them somehow and Shenandoah that means that You will identify them as the regions with the most garbage at them the collection set Evacuate all the life objects from there and move on if you have the generational collectors That kind of workload will just fragment your old generation And you will face either the mixed pause like in j1 or the full gc like in parallel Etc. So this workload kind of tells you that we can survive this Otherwise heavily fragmenting workload that defy generational hypothesis Arguably most of the data that people now store on heap are actually caches So this is your go-to scenario, but you can clearly see how much Those phases take but we have the gc logs there and if you look There you can actually see what is the concurrent Shenandoah gc cycle. So this is the cycle number 55 And it has the pausing it mark which takes less than a millisecond Which we clearly which we very fast scan the root set then it has concurrent mark and any other concurrent gc does that it takes 160 milliseconds then you have a very short final marked phase 1.5 milliseconds and those are the only pauses that are ever observed by the imitator And this is where the Shenandoah beef is you have the concurrent evocation which takes 200 milliseconds there any other collector in open gdk will have to stop To copy these objects. So what the number that you are looking at there is your pause time If with Shenandoah, it's concurrent so you don't experience this as much So do we have time for other demos? All right So so one of the Things about this is that since you know that the most Beefy gc pauses are actually concurrent that means you can run the collector back to back without Much ill effect right the the application was still run for any other collector. That would be insane right if you only If you run gc back to back that means you always stopping the application for a long time It has nothing to do there right Has no work to do there for Shenandoah. We have a different heuristics that drive Collections and one of them that is very useful for testing is is the is the heuristics that actually runs gc back to back And you can clearly see that even though the collector is actually working very hard So the concurrent mark and concurrent evac phases go back to back without stopping to breath The allocator still working the the application is still working even though what you do Is the back to back gcs? So this is how the good concurrent collector should behave right in the worst case when You have so much garbage in the heap that you have to run gc back to back Your application still is able to run Okay, and the barriers right Christine told about the cost of the barriers So let's run some application. Obviously if you want to run throughput application, let's run speck gdm right Let's get the classic thingy there. So if we run If we run a single workload XML validation on the with the parallel collector If you want to estimate the the impact of barriers We want the workload that does not retain much and the xml validation does not retain much It it allocates and allocates Not a lot. So gcs there are actually very very short. So the performance of this workload is roughly It correlates with the overhead of the barriers with the parallel collector You have only a simplest simple inter regional Right barrier on the pointer store. So it's rather efficient there and you have the workload that runs around 400 Ops per minute, right? So if I run with j1 What would you say the performance would be what would you expect from j1 at this workload? So Who's 400 So 4x performance hit 200 300 400 500 Oh, come on. You're not fine. Okay So with j1 if you run this You will actually see that because it has to do more heavyweight write barriers there It it would actually run slower. This is the warm-up. So we didn't we didn't care about this yet, but it will Sustain the throughput of 320 ops per minute And this tells you something important and fundamental about gcs If your application does not care about pauses, then well engineered throughput collector will beat you every time So only if you care about the post time it it makes sense to run the concurrent collector Regardless of whether it's fully concurrent or partially concurrent, etc At weekend so it's 330 ops per minute. So how about shinandoa? So who is who is saying that shinandoa would be slower than j1? Of course it would be who is for faster Of course, yes So for shinandoa, you have the write barriers for every write not only for the pointer writes and you have the read barriers right, so you can Think that if read barriers are very expensive, then you will see an enormous performance impact, right? But it turns out it's not really that bad So j1 made 330 operations per minute and we are doing 320 I think operations per minute there so the cost of read barriers when your read barriers are really really basic Um in implement implementation wise is really not that high All right, but again if your workload is purely throughput You don't care about post times you can afford larger gc pauses then you shouldn't really bother You should just run with parallel old and be happy about it Okay, yes, you want to talk about the pauses? Yeah Cool Oh, no, not not not as all not as all as all is banned to us questions Yeah So there was a it's about the new pointer you create when you delete the old object and the My question is about every m or a process running about Year or so, how do you think this? Iterations over creating new pointers. What would We be the impact on The objects would it is this Solved or is this So you you're asking what happens with the with the previous copy of the object So no, if we create a new object to kind of solve the problem that Threads or any other objects could not find the new object. We create a new pointer if I understood correctly Every object has an associated forwarding Exactly. So this is one iteration on one garbage collection if they Process or whatever it goes for about a year. I think creating a new object So after each a garbage collection, you would create a new pointer, right? Okay, the from region the old pointers go the forwarding pointer has the same lifetime as the object So when you have copied an object to see the two space and given it An important you copy the object to two space You go through the next phase where you do the concurrent marking And you make sure all your pointers now point to the two space copy And then that previous version can get garbage collected. Okay, so there's no history Yeah, thank you So do you support both c1 and c2? Yes, an interpreter