 Good afternoon. My name is Christoph Bales. I'm going to talk about using LNT, which is one of the LLVM sub-projects to track performance. For last year in the same dev room, I already talked about this topic a bit. Last year I talked about how to improve, automatically tracking the performance of the LLVM generated code, and there were, what did I say last year? Let's see. I talked about improvements that were needed in two LLVM projects. One is LNT, the other, the test suite contains lots and lots of open-source packages that get used to test LLVM with. LNT is, I would say, test infrastructure, software, lots and lots of scripts, the database analysis tools to look at test and benchmark results. So I presented a range of IDs last year. A few improvements were already implemented. I also presented a range of ID that still had to be implemented. Since then, most of them have been implemented, and I think that by now, actually it's much nicer to track performance with less human effort on top of trunk. So I'll hope to give you a demo, to hope to convince you that indeed, we've made a big step forward. So just another copy-paste of my conclusion slide from last year. It's a little bit unreadable, I'll read it out for you out loud. There were a number of aspects that I thought in a continuous integration system, we really would like to have this property to be the case. Then one of the properties was, when it flags up an issue, it should flag up the issue so that it's actionable, meaning as a developer, you get the mail or any kind of notification, and then you know what to do to go and fix the problem. Last year, I thought it was improved, but that the red color and the arrow means, it's still very far from where we'd like it to be. Another property would be to require as little as possible human effort last year. Also, my feeling, my experience is still a bit of improvement but still not where we need it to be. Then the last bullet point is really a continuous integration system should help to enable a culture of everyone working in this case on LLVM, to actually act on the Delta. You committed something because of that some correctness or performance regressed. It's natural for you to just go and act on it so that regression goes away. I think last year, well, my personal assessment then was, last year we hadn't implemented any improvements. I hadn't seen any improvements on that just yet, and still far from where we would like it to be. Spoiler alert, I think for all three of these, we've made some improvements. I'll show you some details later on. I also ended with saying, consider using LNTS performance tracking infrastructure. I think since last year, we made it a little bit easier to actually go and do so. I'll give a few more details. So first on, making sure it continues integration system, actually signals things up that are actionable and require little human effort. Let me do a demo of what we typically do. I'll sit down. For the demo, please shout out questions as I go along. If something isn't clear, I've been doing this all-in-off almost day-by-day for quite a few months so I might jump over some issues that are not obvious. If something is not obvious, please shout out. So what I'm showing here is something that's called the LNT daily report page. This is an LNT web server running. Lots of performance measurement data has been pumped into that database, and this report page summarizes issues that seem to be important on that particular day. Now, this date is from April last year. I picked this date because it has a few examples that are easy to explain in the demo. So the resolution on this monitor isn't great, so things look a little bit, on normal resolutions things look a little bit nicer. So there's a bunch of machines that do run tests, and you can see from here the past seven days, whether they did submit some test results or they did not submit any test results. In LNT, there's a number of metrics that can be tracked. In this case, nothing special was flagged up for a code size metric or a score metric. But for the execution time metric, things that get flagged up. Before jumping to the biggest regrets, we see maybe I'll just want to show this one. So this is a particular program in the LLVM test suite where the LNT system says, oh, today it seems this has regressed, the performance of that program has regressed 6 percent compared to LLVM top of trunk yesterday. So one of the things that we added as part of that was also discussed last year is, we have these little spark lines. So what you see here is how the performance evolved over the past seven days. So just on this overview page, you get an immediate feel for how the performance evolved in the past seven days. And we also do multiple runs per day. I hope everyone can see this. Maybe I'll scroll up a little bit. So we also do multiple runs per day. Let me try to zoom in a little bit. And one of the features that I don't know was already implemented last year, is we also have this background color on the spark line. And actually that background color just represents if the code that will generate my LLVM changed or didn't change on that particular day. So it's computed from a hash of the L file called produced, which means that in this case you can interpret. So this is a data point from today. That's a data point from yesterday, the day before. They all have the same background color, which means that the actual code produced by LLVM didn't change on these days, even though the system reports different performance points. Immediately here you can ignore that because you know it's my system that's a bit noisy. The code didn't actually change. There isn't actually any regression here. So already that helps save a lot of developer time because with just a glance we can decide, ah, we can ignore this one, this is just noise in the system. And there's the top points as well, which are identical to the current points. It's multi-modal. Yeah, it's a bit of a multi-modal program. I also had the presentations on that on the US and URL VM last year to go much deeper into what the causes may be of that and how to handle that. So if we now look at real regression, so the top regression here says on three different cores, happens to be Cortex-A53, A57, Cortex-A9 and all of these three cores. The performance regressed by 188% clearly not good. Here 75%. We can also see the background colors here changed compared to yesterday. So there was some actual change there. Let's go and investigate that. So let's click on the link for, that's the biggest regression, A53.1. Where we jump to now is the long-term plots for how performance of this particular program on that particular machine evolved over time. If I would have clicked this on the 15th of February, the chart would have stopped here. So of course now we're already quite a few months later so there's a lot more data points. But yeah, you already see here that this is a major outlier. Let's go and investigate here this. Well, we are now jumping to this particular run on that particular machine. So a whole bunch of programs from the test we got run as part of one run on that particular machine. And we have an overview page here. So yes, once again, this program pops up. 188% performance difference. Yeah, we really need to go and investigate here this. So with some of the improvements that we made since last year, what we now also have is as part of the test suite runs, you can collect Linux perf profile information. And that gets also pumped into the database. And if you go here, you will see an overview of that. So already at the top, we see indeed the number of cycles spent has increased a lot. You also get summary of, in this case, branch misses, cache misses, like the headline, micro architectural events that might cause a performance difference. And then you have the different functions. The hottest function, probably that's the one that where some code generation was different. Let's look at the hottest function with the old version. This is the new version. And so here what you see is site by site, the output that you would get if you would run this under Linux perf. So at this point, this also helps the developer efficiency if before you just knew 200% performance difference, you would have to go and rerun the data under Linux perf to get it. If it gets sorted in the database automatically, that's just- Some similar machine with a similar system. Yeah, exactly. So this just saves a lot of time. I think before we, well, it typically goes for a few hours. You end up spending a few hours unless you do, this is the only job you do and nobody wants to do this job as the only job they're doing. So if we just scroll through it, we see low percentages. Oh, and here we see some hot codes. This is the hot code apparently in the old version and some hot codes in the new version. So sometimes this view already, you already see the instructions. It can be, it takes a bit of time to investigate. So one extra improvement we've done on top of that is in the web interface, we've put a bit of JavaScript to reverse engineer the control flow graph. So what you see now is the control flow graph structure of this whole program. And let me, so what you see here, every gray block is a basic block. You see the arrows jumping between them. You see, oh yeah, there's that whole basic block before actually all the hot codes is in a single basic block. Now we know that all the hot code also here is in a single basic block. We have to compare those two. There's a percentage here, 98% of the time is spent in the single basic block. Now it's 99%. Actually, that's changed instead of relative numbers. Let's go look at absolute numbers. So now it's much clearer that there is a performance difference before we spend about 430 million cycles in that basic block. Now it's having a million cycles. Yeah, the resolution of this, so you know that you have to look at these few instructions here. That's where it happened. The resolution of the screen isn't high enough, but I hope you'll believe me that here you see division instruction. And in the old code, there's no division instruction. Most developers, no division instructions can be costly. That's probably the thing that caused the difference. No, it's the UDF, so an integer division instruction that before probably got, yeah, maybe it's division by a constant, who knows. So I think that might be the end of my demo if I remember correctly. So I think in the course of maybe five to 10 minutes, we've jumped from, let's see what yesterday's run looked like. We had a big performance delta. We ignored some pieces of noise. And now we see, this is the code that changed and it caused the performance difference. And you've done something that Perk still doesn't have on RMOA R64, which is to understand the jumps and do crazy blocks. Yeah, as far as I know, someone might be working on that. I don't know. The last time I asked was a few months ago and there's no support. So for what it's worth, there's a really, really, really simple reconstruction of the control flow graph. It just has some regular expressions. This looks like a branch, so maybe it's the PC. And so it's probably 99% accurate, but for just looking at this, it's good enough. So what we've learned with going in there, we understand what the code's differences from one day to the next. However, in LLVM, typically about 100 commits per day. I can stand up again. Typically about 100 commits per day so you don't yet have the specific commit. If you look at the commit revisions, you always know the division called change. Sometimes you can just guess which commit it was. Quite often, one other technique we use is we use LLVM lab bisect. That's a tool that's in Zorg now. So what that tool can do is if, from all of your built machines, you store continuously, it builds top-of-trunk clang. So you have a whole bunch of clang binaries available. If you store them into a built cache on a separate server, that particular script can fetch the different revisions from that server. And then if you add a script to it saying, this is a problematic generated code, that's a good generated code that can bisect to a specific commit. You can follow the links for more documentation. Documentation is actually really nice. So in combination with Alan T that I showed, we understand what the code change difference is with just a few minutes of looking around. Depending on the size of your benchmark, this bisect will run quicker or slower, but this bisect can point you to a specific commit. So quite quickly you get to the point, this commit cause that code change difference and it cause that percentage of performance. So can you also keep your self-posted stage two? We keep one version, not multiple. So we keep just the clang binaries produced by one bot. But it's just stage one. Yeah. Cause the one problem we had is that most of the problems we find are in stage one. The stage ones are easy to just prefer it, but the stage two ones we have to investigate. And then the clang LLVM lab bisect means, it doesn't help at all because we still have to be at stage two anyway. So for, yeah, you could also, in LLVM lab, you just put, I think, I believe it's, you store the machine name that produced the clang binaries, so you could just store all of them. You could say, these are stage two binaries. If you have a fast machine that could do stage two builds really fast, you can have a really high, nice resolution. Oh yeah, so in the, sorry, question. So for the specific problem we were looking at here. So we ended up sharing that information within 24 hours of this calling committed. This was detected, we found out what it was. How do we share that information? Well, just send an email on the LLVM commits list. Every commit gets an email. Just reply to that one. It gets to the original author saying, we found something here. Luckily, we also saw the problem on one of the public bots. So then it's extra nice, the original developer, if he works at another place, they can also see that particular problem. And yeah, got fixed within 48 hours and regressions are cheap if they get fixed quickly. Really nice outcome. So I've, let me move back on my claims. So I think with the demo, I hope that the improvements that I've shown here in LLVMT, it does help to signal issues in a way that is actionable. We ended up sending an email quickly and require a little effort while actually we all did the analysis together in a few minutes. So now moving to enabling a culture of acting on deltas. I think we're starting to see some more signs of there's an improvement there. At least, I think it was Arno actually sending the email last April, that particular one. And you see it gets acted on. So there's an improvement compared to a number of years ago. That's probably a lot more that could be done to just make it easier to act on deltas. My list of ideas is, yeah, it would be nice if more of these performance reporting bugs would be public. So more people can look at the performance results. At the moment, there's one public LLVMT server. So what I showed here was a top of trunk LLVMT server running on my laptop. The public one runs a version that's quite old. So that we need to find a way to make sure that someone can maintain that public LLVMT server instance. After that works, then some of the things people notice is in the test suite, there's about 500 programs. They don't cover all possible use cases of LLVM. So more code needs to be added. Actually, I see some improvements there. Bitcode files got added from the halide front end, which that must be someone in the room who knows much more about halide than I do, but it's a front end producing IR code directly not coming from Clang. So it's nice, might have some different idioms. Also see the start of some benchmarks representing HPC specific use cases a bit more. But in the end, it's also important that when we add more tests, the whole test suite doesn't run much longer since then we lose resolution. It's really nice now that the test suite all in all doesn't run that long. You can even on very slow systems, you get feedback relatively quickly so you can get multiple data points per day if you want to. And I think the Holy Grail is for correctness issues. We now have bots automatically emailing cometers if they regress a correctness issue. I think the Holy Grail is we want to get to automated emails on performance data also, but you have to have really high signal to noise ratio. So there have to be very, very few false positives otherwise developers will just ignore the bots. So you have to run the performance to understand what caused it to which commit it is. Yeah, but then you saw how much human effort was involved here. I think we're getting close to for some kinds of performance issues, the one I demonstrated. We probably could automate that one. But yeah, there's always going to be lots of like deltas where even between different component developers I will be a difference of interpretation saying, no, this is a regression, no, this is actually how it should be done and so. So about the test suite taking longer, I think that's like self-defeating because we want more tests. Yeah. The problem is the resolution. So one way out of this will be to, the way we do today say test set is the benchmarks. We could say test set is benchmarks one, two, and three and then you separate into three equally running, et cetera. We could paralyze the test suite. Yeah. That's also possible. There was, well, as part of another presentation I gave earlier, I still believe that we could make the test suite run 10 times faster and still find all of the problems we have right now and not have any more noise. Like there's a few programs in the test suite that take quite a long time. If you just make those, just say instead of a thousand iterations for 10 iterations, we'll still find the same problems. We just have it at the higher resolution. It's a bit of work. Using non-LLVM projects. So I think out of everything that I showed here, there wasn't anything specifically for LLVM. It's for a code generator. There's many code generators. So how good is, how easy is it to use LNT and all the code generators? Well, the interface into the LNT server. So what I demonstrated is a JSON file. So all the information is in JSON file. That has been documented since last year. So if you have your own tests that you've invested lots of effort into, as long as you can produce your results into a JSON file, you can input this and you have all of the analysis you have here. So I can just say that at least at ARM, we have the LLVM team using this. Obviously, we also have the GCC team who started using it. And we have a team working on a product called Cycle Models as part of that they're developing a very long to C++ compiler. They also started using this. Just to show that there's, if you have a code generator, my feel is that you could use LNT if you think from what you saw from the demo, it's useful. What else is there? So the profile view that I've shown, it means you have to run, if you run your program multiple times to make sure that you've got statistical, significant results. One of these runs have to be done under Linux Perf to get the Perf profile that has to be stored. So what we've done in the test suite is that with the CMake slash Litification of test suite, which means you compile the programs with CMake and then Lit is used to run them. It now becomes straightforward. If you add extra benchmarks, you just drop in a subdirectory with also a CMake file, then out of the box, it will run multiple times, it will invoke Linux Perf for you, collect all the data in the right way. So if you add more benchmarks to the test suite for maybe your non-public needs, you can do that without lots and lots of boilerplate as long as you have the CMake file that describes how to build your benchmarks. And that suite is already well used to add spec in the embassy and all this. Oh yeah, yeah, yeah. So yeah, so we've added quite a few, benchmarks beyond that to these kinds of runs. So in summary, I think there was, well, without necessary that much work, I think in using Alantheta was a big improvement compared to last year. Documentation has improved, probably they're still quite a bit to be done because documentation gets written by the people who use it most actively and then you get blind spots for things that might not be obvious. So if you would like to try Alantheta and you find some of the documentation isn't perfect, then do please raise it either via email or in the LLVM bug tracker on your component Alantheta. Or if you do use it and you find some issues, there are some issues, do please file tickets in there. And that's all, any questions? Great, oh, you're good. How does it scale? How can I answer that? So we use it internally in ARM and put our performance data into a single database. Actually, there's two underlying databases that can be used, Postgres or SQLite. So far we store everything in SQLite. We don't have a huge issue with it. Maybe the first person accessing the web interface in the morning, some of it needs to be cached in, it takes a little while, but once it's cached in, it works. Well, so all in all, I think it doesn't have huge scaling problems. Some of the analysis that gets run, for example, in the data report page, they're quite involved. They need to pull in quite a lot of information with a daily report page. You get it back in a matter of seconds. So yeah, we haven't seen too much problems there. There's one issue in the database scheme where it's a little bit inflexible to add extra metrics. Let's say, for example, next to the code size or the performance of the code, you want to measure another aspect of the generated code. If you want to store it exactly in the database schema, that's a bit inflexible now, but it's also a bit hard to change that whole schema because you have to make sure that you don't, you make scaling at least as good as this right now. So in a test suite, you've got something like 500 tests per run and then you have a whole bunch of machines producing that and then, so it probably, you need to ask about how does it scale for number of tests stored in the database or number of, for this daily report page, number of tests run on that particular day because that's the amount of data that will get analyzed. Yeah, I'm not sure I, I think maybe we must be somewhere around 10,000, 100,000 tests in a day probably, just of the back of my head, order of magnitude. I'm assuming there must be people who go in order of magnitude higher at least. So try it out and give feedback. But the report page, it only analyzes the last N days, it doesn't analyze the whole thing, so it doesn't need to scale to the number of tests. The only thing that reads all the tests is the final graph page which shows all the tests but then there's just data points. That's data points for a single program. Yeah. So yeah. You never have to analyze all the data from all the time and et cetera. No, well, if someone would want to create another view where you analyze all of the data for all of the time, then yeah, that will be happier. And you can scale the number of days, right? I think the production won today's three days max and what you're showing is seven days. Oh, yeah, yeah, that's just, so let me go back. It's actually, so there's, so if you look at here in the URL, it's a parameter there. Like very small tweaks like this to just make it an input box on the web page that would be really nice, it's not there. So there's probably lots of tweaks that can be done, but at least for myself, I'm a compiler developer and not an infrastructure developer like this. So I added stuff until my job got a lot easier. That's it. Yeah. So you said, the question is, can you use it with a JIT? So definitely what the page you see on the screen now, any code generator you can use it with where it gets more interesting, probably it's on the profile page where you see this assembly. And then it depends, right now, Alentea assumes that in the JSON file with data you produce, it has a format that looks like Linux Perf. So you have a percentage on the line and then some code next to it. It does, it doesn't have to be an assembly instruction. But you're gonna have to... So how you get out of the JIT? How you get to the data from the JIT, I don't know, but if you have data that you can represent with a percentage of execution time with a line of code next to it, it can go in. Okay, so it seems like that's it. Thank you very much. Thank you.