 Hi everybody, I'm Christine. This is a gratuitous picture of my dog, just to set the tone, for playing well together. So I will play touching in the leg. I didn't write any of the things I'm about to talk to you about, but I think they're kind of cool, and I came up with some motivating examples that I think are approachable, so you guys can see what's going on. So, let's talk about Java, right from the very beginning. You guys all know this trick. Your Java source code gets compiled by Java C into bytecode, and your bytecode goes into the JVM, which is kind of like a black box to all the hardcore Linux folks. So your Java bytecode gets interpreted by the template interpreter, or it can be compiled by either the server or the client compilers. And from the outside perspective, looking in, they don't know how to interpret those. So there have been some tools recently developed to make common Linux tools work on code running inside the JVM. And that's what I'm here to talk to you about. And I'm going to demonstrate some examples. So the two tools I'm going to talk about are GBB. GBB can now work well with your Java methods, interspersed with all your C methods, when you're trying to debug your Java code, which is what you're going to want. And for me, as a JVM developer, my real job is writing garbage collection algorithms. As a JVM developer, that's what I want. I want to know which Java method it was that I screwed up. And so I can figure out what I did wrong. And I'm also going to talk about the perf Java tool. In a previous life, I spent two weeks trying to track down why my program wasn't scaling. And it wasn't scaling because somewhere somebody decided to have a global variable. And I only found that by looking at the cache misses. So we're going to demonstrate the perf Java tool and how it can show you when your cache is missing. And why your program isn't scaling. So I have to give you the motivating examples, right? I didn't really have enough for a whole half hour without giving motivating examples. So prime numbers, you guys, that's universal, right? We all know what prime numbers are. They're only divisible by one of themselves. And this is, is this going to work? This is going to work. So the way people who want to calculate prime numbers do it, sometimes as they go through something like the civil-arrested themes, where they cross out all the ones that are multiples of two and all the ones that are multiples of three and all the ones that are multiples of five. And then they put up the numbers that they still hit as they go along. So you can see the two, three, five, seven, eleven are all prime numbers. So this is really easy to write in Java streams. In fact, this is the program to calculate, to write out all of the prime numbers up to whatever number you want. So if I wanted to go prime numbers up to one thousand, I'd just go from two to one thousand. And then I'd filter them from two to the square root of one thousand and see if none of them have, if you do a remainder divided by, the first number divided by the second number, if that's zero, then we know it's not prime. And you can see if you run this. And I have time for a demo. I'll show you how this works. But it will work and it does do what it says it does. But it's very abstract, right? It's very high. And I'm very down low. And I really want to know what's going on. And especially, right, I put this sum in, because this is a reduction, because that makes it more, I can show you the demos and you don't have scrolling lines of hundreds of prime numbers. But in the Java streams, if you put the parallel keyword in, all of a sudden your program magically executes on as many threads as you have. And it's just black magic. And so if I were a Java programmer and I was looking at this prime number, and I want to see how this is actually work, what's going on under the covers, you might want to run GDB and look at it. And in fact, I will show you this later. If you just run GDB straight out of the box, the way it is now, you're going to get something that looks like this. You're going to see a bunch of addresses. And these are the addresses, I don't even know from this, if they're interpreted code or compiled code or whatever. What all you get is something with machine addresses. And that's not helpful. That's not going to do you any good. So the GDB unwinded to the rescue. This is some code that was written by my colleague in the Red Hat Java group that actually knows about how the symbols are stored inside the JVM. So when you run with his unwinder, and I will point you to the methods for how to get that all installed, but you can see that this program, okay, I have some interpreted frames here. I stopped it pretty early on in GDB. And you can see that we actually have the reduction op, which was my sum, and then we have an interpreted frame op top, which is for joint tasks. And that gives you an idea of what's going on under the covers. It's really work stealing. It's really that whole job and you don't concur in API. But here you have someone that you can go and look and you can say, okay, this is what my program is doing. This is where I am. If it went wrong, you can go and see the Java stack down to the C stack down to your calling function and figure out what's going on. All right. So I also did it compiled. And here you can see that some of your methods were inline by the JVM. And so my first Java frame, my first frame was really all these inline methods and another method was compiled. And it basically gives you an idea of what's happening inside the JVM if you want to have a better understanding of what your code is doing from GDB. Here we are again, showing you the for joint tasks with... I don't know why that's not there. Anyway, oh yeah, that's right there because what's interesting is your code, the hot methods, the ones that are deep are compiled are the ones that are run a lot. So deeper you go down the Java stack, you're likely to start running into interpreted frames because these are the high level, the top level methods that you call that only get run once. So you can sort of see that there. And you have better insight into what's going on inside the JVM. And you can comment out the parallel and do it. And you can see that the sequential looks a lot like what I showed you for the parallel except that you get an evaluate sequential for your reduction op for the sum. If anybody has questions, please ask and I'm going through this really quickly. Yes. Does this work for other JVM languages or is it at the bytecode level? And then so you might not get useful readable information. The letter. And if you want it for like J Ruby or something, yeah, no, it's... Okay, I can't say that for sure. It might, but what I understand is the way it looks at the Java methods that are compiled and so if you've got anonymous methods or lambdas or something, it kind of gets confused. And so there's the peripheral that's coming out. Well, I'm speaking about this. Will it work with all JVMs like Azul, for example, or it will work only for open JDK? Only open JDK. I can't swear it won't work for Azul. I've never been able to run Azul. But I don't see any reason why it would, right? Because we all work on open JDK and we know the inner development of JDK works. Okay, so where can you get it? There's the path nine where you can go through addresses. It's a wonderful work for Red Hat where everything's just out there in the open. Andrew Dean wrote an email message where he described exactly how it works and how it goes diving through the GDB data structures. And it's kind of magic for me because it's really nice. I spent a lot of time taking addresses, going back, trying to find the compiled method and that's really stinky and so this is a big help. In my real life, I put in re-barriers in method and if I have a missing re-barrier, I have to know what method it is that has a missing re-barrier and so this is huge. Okay, so my motivating example for Perf is a random number generator tester. I've also spent some time writing random numbers. And so if you generate a stream of random numbers and you filter them in bins based on their remainders, you should get an even distribution, right? If your numbers are really random, they should sort of get into all of the different bins. So if you have three bins, 91 has one remainder when you divide it by 3, 17 is the term I mean. So basically you can set up as many bins as you want, run the remainder function and you should get an even distribution. And I use this all to create a hairy arm problem. I wanted to create cash contention and doing it on a single variable didn't work but doing it on a bunch of bins worked. So I do this calculation in parallel with some number of bins, some number of threads for some number of seconds and I look at the distributions and see if they're good. So if I ran... I just ran it with the Java random which was pretty good, right? And look at these numbers and I ran it with 4 bins and 10 threads and you can see that I've got a pretty even distribution across my 4 bins. So what's the point? I'm using all that I'm putting this array and I'm going to make it a bigger array, a 1024 array and I'm going to have a bunch of threads all banging on it. And if I wrote it correctly, your L1 cache should be pinging from thread to thread and we should be able to see that in Perf. And of course we can because I'm going to demo it for you guys. So again, if you don't have the jittered symbols and you run it, you get something that looks like this. And this is doubly heinous because you can see those addresses are pretty close together. They're actually one of those, I believe is the interpreted method and one of them is the compiled method but it doesn't know, right? The Perf tool doesn't know that they're the same thing. So once that I added the Perf tool you can see that it conglomerates those together, recognizes that they're both in my hairy arm loop. And that's, you can see that when I ran this I got 97.9% of my cache misses were writing to this array, which is what we expected. You can get Java Perf. It should be available in world 7.4. I'm not a project manager. This is what people tell me but I'm just telling you about these interesting tools and where you can get them. I'm not, I don't have any control over when it actually goes out. And that was actually written by Google but it's again, it's very useful for a Java programmer to be able to see where in their program they're getting cache misses or TLB misses. You know, you get TLB misses, probably when you go to large pages because if you use large pages you get less TLB misses. If you're interested in performance it's very helpful to have a tool like this. And I just, I added this on just because it made the demo look nice. There's something called Perf Bar that back in Solarisland was really cool and this, Red Hat has nothing to do with this. Doug Lee put it out there and somebody else is maintaining it. But I wanted to bring it to your attention because it's so helpful for doing performance work. And this is an awful screenshot of my machine but you can see that, you know, this was the point where I was running with I think six threads and you can see that you have to see the other threads are idle. If you're trying to write parallel code that uses the whole machine having something like the Perf Bar up and running gives you a visual display that yes, in fact, you are running on the whole machine and you don't have some stupid global variable keeping you from getting the performance that you want. Okay, so I'm doing really fast. Let's get to the demos. That's Perf Bar and I'm going to run, actually I'm going to be done in less than 20 minutes. Okay. Well, hopefully we can get questions. Alright, so let's go back to the GDV. So if you run GDV without the GDV unwinder, this is one where I don't have it installed, you can see that, no, it's even worse. You see all these question marks and you don't know what's going on in vTable for no GC verify. I don't even know what that is. But if you run the other version, which has the GDV unwinder in it, you can still see it doesn't show the symbols when you do all the threads. Oops. Okay, there you have them. So you can do that again. But you can see what I showed you before where you have the in-lined methods all lined up. And that's it. It's not hugely exciting. It's not bells and whistles, but it's extremely useful. And I will show you the Perf. So we're using the whole machine. I'm asking for 300 threads and you can see that my little 8-core box is going crazy trying to run this thing. Now will be a good time for questions because I told this to run for 100 seconds, so it's going to be a minute or two before it comes back. Is anybody... Is this something that people can use? I have a question. Does the unwinder require you to build up in detail that there's some debug mode? Yes, you need the debug mode, but there are PMs with the debug mode out there. I built the debug mode because I'm used to it. I have a slightly topic question. Have you heard any news about Chinandak? Have I heard any news about Chinandak? Where did I hear news? Have you heard any news about Chinandak? I'm going to give a talk in Chinandak tomorrow. And I will tell you everything, but the top-level, high-level bit is that we ran a certain warehouse benchmark that I'm not supposed to name and we won in both critical ops and in throughput ops. So both throughput and responsiveness. So we're there. We're real. Oh, nice. And it's still running. Christine, maybe you said already these tools will run with OpenJDK on Windows? I'm a deer in the headlights. I don't know. I'm sorry. I am... Perf is one, right? Perf depends on... Yeah, Perf is a Linux tool and I... GNB... What you do is you run Perf to get the Perf output and then you have to inject the jittered symbols into it. Once you've inserted the jittered symbols you can then do... And here you can see that I spent 98%. And just for Grins, I ran one of these with one thread instead of 320. And you can see that in the single thread of case where you don't have that cash convention, all of your cashmases are in other places. So it's one way to see that your parallel is fighting over the cash. Okay, I guess I talked even faster than I expected. This isn't my area of expertise. These were all written by other folks. She's signed her den in Dublin and the woman at Google. So if anybody has any more questions, otherwise I guess I can let us out a little earlier. Maybe just a little small addition that this is really useful and if you've never heard about Java Flame graphs, it's another very useful tool because it can combine the output from Perth and Java together. So you can, for example, in one graph, see also if your public is in I.O., for example, a network card and something like this, you can all see in one graph with this tool. It plays then with Java really nicely together. So it's another very useful tool, I think. Flame graphs, yes, they're very useful. And if I had thought of them, I would put them in here too because I use them. Yes? What's the degree you're running for? Because I want to make you dive deep into this if you don't want to, but a lot of this is in the VVM and 1% of these are in my register allocation. Is that running long enough to even... It's running very short. I'm running for a thousand seconds. Sorry, 100 seconds. So it's a short one just to be able to demonstrate what I wanted. I don't trust the rest of the stuff. Some of this, the Perth is not always great about figuring out what's really yours. You know, I've had Chrome show up in there sometimes where it's just... that's where the register will happen. But the top ones you can trust and that's really what you want to look at is certainly anything above 10%, probably anything above 5% is something you'll probably want to be looking at. So you guys now can get random numbers and prime numbers and Perth and GDB and the Perth bar.