 Okay. Hey. Hey, I'm Chris. I switched companies. I work for Twitter now. And I'm still doing kind of the same stuff as I did at Oracle. So I'm still working on compilers and I'm actually surprised that there are so many people here today. Usually the room's empty when I talk about compilers. So we're trying to grow that Twitter, you know, I say experience in a production environment. That means we are not running it really in production, but we tried it. And I'll show a bunch of graphs later how that works out. So I know all of you have Twitter accounts. So give a little love to FOSTEM and maybe to our Twitter VM team, you. So a little bit about Twitter. You all know Twitter as a service. This is how it looks like. So it's a huge distributed system. We have many, many services. We have the tweet, the main tweet service, which reads and writes tweets, and then user service, timeline service, social graph service, you know, all the difference. So these are the kind of the main ones. And the service, it's many JVMs per service, right? We run tons and tons of them. We have thousands of machines and running thousands of JVMs. And we have multiple data centers. So there's a lot of stuff going on. So if we can cut down with, you know, on maybe CPU time or memory usage, this pays off a lot. Twitter is doing open source. So we love open source. We are using many projects. We're giving back to the community. This is the link where you can find all the projects. I think, actually, all of it is on GitHub. I might be wrong, but. So Grail is a good fit for us in general and because it's open source. And we are planning, if we find something that we can optimize, we definitely contribute it back. So Twitter internally, we have our own JDK. It's based on an open JDK right now, eight update. We're in the works of open sourcing our stuff. From what I've heard, we've done over the last couple years some GC optimizations, but I think almost all of it is already upstream. So I did this. I backported JVMCI to our own eight JDK so that we can actually run Grail. We have something called Contrail. It's basically a JFR replacement. I'm not talking about this. And CMS improvements, but as I said, what I've heard, it's all upstream now. So why Grail? I think every time I'm here, I'm talking about this. I've worked on C2 for a very long time and it's very complicated and it's not really getting better. There's work being done to clean it up a little bit but it's still the same old complex code. It's the learning curve. It's way, way too steep. We've noticed that over the years when we hired new people, it's like these people, they need to learn for years and years until they can actually work on C2. In my opinion, there were no major optimizations in the last couple years for C2. Most of the stuff that's done for C2 is some intrinsic stuff, but there was no new, as Cape analysis implementation or not really an improvement with inlining and all that stuff. There was a little bit, but that's another topic. And in my opinion, it's reached its end of life already a long time ago. So Grail's learning curve is much shorter for people who have looked at the code. We talked about it, yes. But compared to C2, I think it's shorter. You have the advantages of Java compared to the C++ version that Hotspot's using. You can't use any utility stuff. And it's highly modularized. So it's kind of an old snapshot. I don't know if it's still true, but it has like 83 words in that area. Different modules that have, there are no circular dependencies. The build system takes care of that and makes sure you don't introduce circular dependencies. And so everything's, you have platform independent modules and then you have platform dependent modules that implement stuff for your CPU architectures and so on and so on. Okay, so we've ran Grail at Twitter and we found a few bugs. So actually not that many. We basically only found two. I'm talking about three bugs here now, just two. So do you have an idea of the things we noticed? This one, I think I actually noticed that one even before I started running stuff at Twitter, it's basically, Grail does not support certain on stack replacement compilations, which turns out could be an issue. So when you turn on print compilation, it looks something like this. It says, oh, I can't do an OSR with locks. And Tom, Tom Rodriguez, and I would discuss this a little bit. Then the way Grail is currently set up in its tiered environment, it's not really a problem because what happens, you get this message a couple times. I can't remember the number, a hundred or a thousand, I don't know. And then the tier compilation system says, okay, I'm not compiling this with tier four anymore. So I just compile it with tier one and it compiles it with C1 and it's basically fine. It could be an issue if it's a very preferring sensitive method, but I've not found a case where that's an issue. So, bug 128 is still open, so if someone wants to go and fix it, please do it. I don't have time right now to do that. So this is a real bug that I found while running, I think it was the tweet service actually. The thing you want to notice is this one here. It had to run for a couple of days until it crashed, but it crashed consistently. So I was trying to figure out what this is and then eventually I figured out that it's this thing called heapster that we used to analyze the heap. So it's basically a bytecode instrumentation thing. Oh yeah, here it is. That's the GitHub page and provides an agent library to do heap profiling and so on. So we use that. And it turned out I'm not going into the details here because we don't have the time, but this is a snapshot of the discussion we had on the GitHub page. It had to do with intrinsics basically. So there and the one that failed was the double value off because it's part of the core library and it's used by Graal itself, but then it got instrumented because it does a new object and so on. So that was basically the issue. This one was closed. There was a lot of big changes you can see. So 65 files had to be touched, but now it works. This one was annoying too. So I renamed it later when I figured out actually what it was, but this is what happens. We saw these weird exceptions flying by tons and tons of IO exceptions and I couldn't figure out why. And so some day I decided to run the neti4 test. Someone told me, ah, some of our services have just upgraded from neti3 to neti4 and I thought okay, I'll run the test. We went out. And this is what happened. Neti buffer failure. All right. So this one was a really, really awkward one because reverse bytes didn't work. And you would think if that doesn't work, something else will break, right? But it didn't. It never did. It basically was wrong from day one, but just in this particular case it didn't. So Tom fixed it. We're out of small fix, but and these were the only two real bugs that we found. Everything else just worked fine. So and now I'm coming to, I think the more important and interesting part, a couple of performance graphs and stuff. So I use the Tweet service because it's our main service. It's a finagle thrift service. You can download finagle from that page, play around with it. I have dedicated machines for these testings and all of the instances receive the exact same requests. So not even the number of requests is the same. It's the exact same request. So they, and it's read only by the way. So because this is in a staging environment, we kind of write tweets, but we read tweets. And we read the same tweets. And I've ran this with the Gravium 0.17 with the JVM. So this slide is only to show you what the load looks like. All of the graphs are 24 hours. It's one day. And there are two colors in here, blue, C2, and the orange one is Graal. And you can see, okay, it gets the same requests throughout the day. I picked this particular snapshot because you have pretty high loads, but you have this plateau down here with low load. I don't know why this happened, and it doesn't really matter. But it's a good example to see the difference between high and low loads. So most of the graphs that I'm showing, they have a moving average, because then it's easier to see what's happening. So this one's using a moving average of 60 minutes. And these are the scavenge cycles that are happening. I cannot, yeah, I should mention this. I cannot show you the y-axis, because I'm not allowed to. It's confidential information, so I have to basically tell you what the percentages are. So here what we're seeing is between one and two percent less scavenge cycles with Graal. That's mainly because of the better escape analysis. So you produce less garbage. And so that means you have one to two percent less scavenge cycles. This is the scavenge time. This is, again, moving average 10 minutes now. And I think all the following slides are also moving average 10. So there are two things to say here. Because we have less scavenge cycles, and the reason we have less scavenge cycles is we have less garbage in the young generations. And so it takes a long time to fill it up. But when it fills it up, and it hits basically the thresholds that it kicks off at GC, you have less garbage in it, more life objects. And that means the collection then takes longer. So this, I think it's somewhere here. It's like over here. It's a maximum 30 percent more scavenge time, which is quite a lot. But usually it's between 10 and 20 percent. Still a lot compared to only one to two percent less cycles. This is the old gen. We have one, two, three, four cycles in a day. The interesting part here is this. So the old occupancy when you run Grawl is higher. It's because Grawl is a Java program. It has state. And so this is the state of Grawl. And it's between 10, over here, I think that's 10 megabytes and 60 megabytes. You have to account for that. So there are multiple ways to deal with that right now. There's no real way to deal with it. You can just increase your old gen size by 60 megabytes, and you're basically back to where you were. But it's important to notice that this is happening. No, the old generation in the heap, Java heap. Oh, not old. So there's P99 latencies for Tweet reads. I think, again, this one's a moving average of 10 minutes. There's some awkward spikes in here at C2. I don't know. It doesn't really matter. But what do you see? What are my notes? Oh, yeah. You cannot really see a lot here. This graph down here is basically an integrate over the graph up here. And so the difference, it's hard to see here. There's a difference. Grawl is slightly higher, 1% more. So you have 1% worse P99 latencies. And this is probably very likely because of the highest Kevin's times. It's easier to see with the P49 latencies because, and the difference is then, what is it? 2.5% more. That's what you're paying. Oh, it's the request time of a Tweet. Yeah. So for this one, I could show the y-axis, actually. It's split out. This is user CPU time, and this is system CPU time. I did it because it's very interesting down here. You have slightly better, what are the notes? Up to 4% better user CPU time. I think this is because you have less Kevin's cycles. So that means you have less CPU time usage. And also, I think it produces slightly tighter code. The interesting part down here is that how much is it? Under load, 4% worse. Yeah, 4%. So this is 4% worse system CPU time, which I cannot explain. I don't know why. I have no idea. But since the overall system CPU time is only between, what, 15% and 20% and you have 4% more of that, it doesn't really matter. This one's a very interesting graph. How am I doing with time? Pretty good. This one's a pretty interesting graph. So what it does, it shows you how many GC, this one, and CPU milliseconds, no, how many tweets per GC and CPU milliseconds it can send out, basically. And so, higher is better here. And you can see that the GC, for the GC milliseconds, C2 is basically always better. And that's because the, what was it, like, 20%, 30% higher Kevin's cycle times. This is that graph. But on the other hand, when you look at CPU time, you can see that the growl, especially when the load is low, that growls better. So this is what is, the top graph, you see it's 20% less. And the bottom graph, CPU, up to 8% better is probably here somewhere. But, yeah, but only when the load's low. It's basically the same here. And then if you remember the first graph I showed you, this was a pretty high load. And then it went down again and you can see that the graphs go like this. Yeah, this is pretty much what we've seen so far. I tried to come up with a summary slide. This is basically the only thing I could come up with. We could, in theory, replace C2 with growl at Twitter today. There's an issue with the P49 time. Some people care about that at Twitter. And 4% might be too much. I played around with, you know, the thing I mentioned with the young gen and the scavengers taking longer because you have more life objects in it. And then the old gen, the slightly higher occupancy of the old gen. I tried to change the ratio of the young sizes and the old sizes to maybe come to a point where I can get the scavenged times back to where they were. And the old gen, you know, slightly bigger so that we have the same number of old cycles. So far I haven't found the right ratio. So that's why I didn't show it because I couldn't get anywhere near the scavenged times. So I have to spend more time on doing that. Yeah, that's it. I have a ton more slides, but I don't have time to show them. But five minutes left. There you go. Yeah, I can take questions now. So besides the two bugs that I've shown that actually crashed, it hasn't crashed. No, but I have not run it on a thousand machines. Right. So in production, I've ran it once in real production, but it was only one machine. And the dedicated machines I have, it's basically per configurations one machine. But I've ran them for, I don't know, a week or two. And I've not seen any issues so far. And we restart our services pretty frequently. So it's there not running a year, right? Because when we update the software, we have to restart them, of course. I think I'll take a short question. What are the typical CPU utilization in Twitter? CPU load, maybe? It's basically the one graph I showed you. About 20%, so things like that. No, I mean, the graph I showed you, it's not, so the y-axis was a course, that's what they call it. So it really depends on the load you get, but it can go up to, and there's not, we have multiple services running on one machine, for example. So it's, you know. So what is the performance of the compiler itself, compared to C2? So how fast does it compile a method? Oh, yeah, yeah, yeah, I'm right. I haven't mentioned that. So one thing I completely left out is the issue of the warm-up startup problem, which for, I guess, for a lot of people is an issue. But for Twitter it's not. Because you have to compile your compiler, right? Because it's Java, and it's not AOT compiler because it's A. But it only adds, in the tiered environment, the way it's set up today is that GROG gets compiled with C1 only. You can tweak that, you can switch it in a way that actually GROG compiles itself, but the default way is that C1 compiles it. So it's pretty fast to compile GROG itself. And it adds about, I think I tried it, starting up an empty Eclipse workspace or something like this. It adds about 20% of the run time, or the startup time. But it's only in the first 20 to 30 seconds, right? And then the throughput. If you compile it with C1 only, it's not as good, obviously. Because C1 is not doing that many optimizations. But if you compile it with GROG itself, it's about the same. Kind of continue the topic. So if you load a new class and your CTA gets invalidated, or final is no longer final, and you get it deopt. Do you measure, you know, if the drop back, the time it takes to recompile it, make given method is longer than the one with C2, or is it on pair, and unnoticeable difference? It's about the same. I mean, at the time when it depends on when it happens. If it happens later in the game, when GROG's already warmed up and everything's compiled, it's about the same. But if it happens earlier, it might, because it's an OSR, it might trigger some paths in GROG itself that are not compiled yet, which means you have a slight delay. But it's not that bad. So really cool work. And do you think there's anything in particular in Twitter's environment that makes GROG perform this close to C2? I mean, at least from my point of view, they are really close. Right. Are there any benefits of running in the Twitter infrastructure compared to just letting this loose on all kinds of applications? I don't really know. Most of Twitter's services, almost all of our software is Scala. I don't think anyone has put some work into optimizing GROG for Scala, but it seems to just work fine. So it definitely depends on the code shape, but it seems the stuff we're doing, like this service-based, finagle, whatever stuff, it seems to work fine, yeah? There are definitely other cases where it's not that good. I might have shown this last year. I can't remember a year before. When you run spec JBB with GROG, it's like 20% slower. Hey, Chris. How does the additional heap required translate into overall reserved memory compared to C2? Because I imagine Hotspot has its own data, but it's off it. Yeah, yeah, right. So there's something there. It's not... Oh yeah, definitely. Yeah. So I've not done this measurement in quite some time, but I've done it a few years ago. How much memory is C2 using when it compiles stuff and how much is GROG using when it compiles stuff? C2 can spike up to 500 megabytes before compilations. I mean, these are the outliers and it needs to be a huge method with lots of loop optimization stuff going on, but it can spike a lot. And usually, you don't see that. You see it with GROG because it's on your Java heap, right? So it's much more visible. So back then, when I measured this, if you run with compressed loops, you're actually using less memory than C2. But if you run with the regular references, it was slightly more. But I'm really hoping for value types and all that stuff to help out here because it's basically a compiler graph. This is the biggest thing. And if you have value types for that, that will be really cool. So I'm just waiting for that. Okay, that's it. Thank you.