 Good morning. Welcome to FOSDEM. I hope you guys had a fun time last night. It's good to see so many people have recovered from the festivities. Hopefully, I'll have a seat to sit down when I'm done with my talk. It's very crowded. I was told before I came that this was the worst possible time slot to give a talk at FOSDEM first thing Saturday morning, right? But I'm happy to be here. I'm glad that Luke has organized the Graphics Dev Room again this year and that he made time for me to give a short presentation on my work. So thanks a lot. I've been working on Linux platforms for more than a decade. Several of those years I spent building graphics performance tools based on Windows Windows tool that was used throughout the industry. And in that position, I was able to see how important performance analysis tools are for graphics workloads. And my project over the past few years has been to try to enable the same workflows for Linux platforms. And I've also spent a lot of time automating the integration system for Mesa at Intel, which has helped Mesa's productivity and quality quite a bit. But this is really the project that I've been most interested in since I started with the Mesa team. So a little bit about GPU tools and why you don't really have very many good solutions in the Linux space. In general, when you have GPU tools, there's a graphics card vendor that understands it's very difficult to go and find out performance bottlenecks or what's happening on the GPU. And they've gone and funded some tools specific to their own hardware to help developers of their own driver team figure out what the performance profile is of specific applications. But they are very reluctant to go and enable the same capabilities for their competitors. And so if you do find a good GPU analysis tool, you'll often find it only works with an AMD GPU or an Nvidia GPU. Some of the exceptions in the Linux space are made by Microsoft or other entities that care more about cross vendor functionality. Most of the tools are written for Windows and Linux as an afterthought. They're either closed source or the extent to which they're open source is just two commits where they've dumped a huge pile of code into a GitHub account. And whether it compiles or not, you know, you may find that it does not. So this is changing a little bit. Intel has some engineers that are working on performance tools like myself and Lionel, Lander Willen and Robert Bragg have worked on GPU top. So there is more native support for performance tools. RenderDoc is another example where Valve has gone and funded a developer to really invest in native Linux graphics analysis tools. One thing about a lot of the tools is that tracing and retracing is often not reliable. This is, you know, it can't be because the tool was initially written for Windows DX11 or DX10 games. And then when they go to implement tracing for OpenGL, they find the complexity of the extensions makes it hard to really capture the workload that you want to investigate. And also another reason why tracing is often unreliable is there's not that many users. So you might have a tools team that goes and tries to build a tool. But unless you have lots of developers going and applying it and looking at different workloads, you're not going to discover the bugs in your tracing system. So, and up until recently, a big barrier has been the support for GPU performance counters in Mesa. Since Linux 4.13, that's enabled now for Intel GPUs and AMD performance monitor is available for some of their newer hardware as well. So now that Mesa is exposing these extensions, there's a whole lot more that we can do. So my tool is called Frame Retrace. It's built on top of API Trace. I chose API Trace because I think it's the most widely used GPU analysis tool. There's a lot of people that use it for quality assurance to make sure that the frames retrace properly. And because it has a large number of users, there's often a lot of corner cases of tracing that they've gone and fixed. It's a community supported project, so there's lots of people working on it. Right now, Frame Retrace is just a directory and a branch of API Trace. It's just a UI that is built on top of it because API Trace is cross-platform. Frame Retrace is also cross-platform, so it will investigate open GL workloads on Windows just as well as it will on Linux. And that's an important capability for driver teams because if you have two different driver implementations for different platforms, you can compare the performance profile for the workloads and find gaps in your implementation or in the Windows implementation. Our counter support begins with Haswell. There were hardware counters prior to Haswell, but the architecture was different enough that the driver team decided not to enable them. So, your performance will be better with a newer computer anyways, right? The Mesa driver team has been using this tool heavily to go and find issues in their driver. And there's a whole set of examples of different special cases that they've missed and we found basically by looking closely at each render in a frame and understanding what the bottleneck is. Right now, I'm trying to add support for Radeon hardware and Raspberry Pi through the AMD Performance Monitor extension. And there's some other folks that are looking at that with me and it's going pretty well. There's a few stumbling blocks for the Radeon implementation of that extension. I think that cross-platform support in this tool is one of the main things that needs to be finished before it's a good candidate for being upstreamed into API trace. I think that you'll see that the tool is pretty compelling and useful and superior to the API trace UI in some ways. So, I'd like to see it go upstream. So, what does this tool do? Most graphical applications have a render loop and the render loop just renders the frame over and over again. So, if you are looking just at the renders in those frames, you can divide up the frame into each specific draw call. And this tool will give you the metrics associated with each draw call and you can see exactly which render is the one that's taking all the time in your frame. Without it, I mean generally you just have a huge asynchronous workload going off to the GPU and you have no idea why you're missing v-sync. You can explore the frame by selecting specific renders and it'll show you the render targets throughout the frame which is helpful to understand how a frame is composed. It has an API log which is pretty standard. For driver developers, it's pretty helpful to have batch disassembly. So, the batch commands which are sent directly to the hardware are disassembled and associated with the render that you've selected. So, this is a capability that at least on Intel hardware you have to up till now you would have to dump hundreds of gigabytes of data for any kind of meaningful frame and then try to sift through the data to try to find out exactly which render went wrong. And this will give you a much more performant implementation and let you see exactly what's going to the hardware for each draw. One of the main features that end users and game developers need is a shader debugger or some way to experiment with their shaders and find out why their shaders are misrendering. So, with Framer Trace you can go to a specific render, look at the shader, change the shader, edit it, compile it and it'll render again and it'll give you a new performance profile for that shader or an error if you've made a mistake. You can do the same thing with uniform constants, just go and see what the constants are and change them and the frame will render again. There's a couple of experiments that you can do to help you try to figure out what the max performance would be for a specific render. And the thing that I've just been editing now is a hierarchical representation of all the GL state so that you can change like the cold face and see what happens. So if you have a problem with your GL state that's affecting rendering or performance you can muck with that. So those are the things we'll go through in the demo. So, I'm taking a risk, let's have a demo, see what happens. So this is the UI for Framer Trace and if you, this blue bar is actually a graph of renders with no metrics but you'll see here there's a long list of GPU metrics associated with the L3 cache, you know, the pixel shaders, vertex fetch hardware, a lot of these are somewhat inscrutable if you're not familiar with the hardware or don't do a lot of GL programming. The one that you really want to look at if you want to see why is this slow as you look at how many clocks were required to render the frame. And so this is a graph where each bar is a specific render. There are quite a lot of them but by far the most expensive one is here and there's a table that will show you the metrics. So here is the clocks and you can see that it's more than 10% of the entire frame is just for this one render. So if you're curious about what a GTIL3 Bank L2 read is, there's a longer description for that metric that will help you decipher what it means. But typically you can go through here and find an explanation for why this might be the bottleneck for your workload. If you want to see the render target at this part in the frame, you'll see that our heroine is found the object of her desire and the rendering of this frame, if you want to see what's actually being rendered, it's rendering the whole screen in the API calls. It's just drawing a couple of triangles for the rect. So it's a little bit puzzling why this might be long but there's also this GL memory barrier which is probably something that we'd be interested in looking at. If you want to search for GL memory barrier, you can look at the different renders which contain GL memory barriers. So if you wanted the experiments, if you wanted to see how fast would this be rendered? If I just had a simple shader with it just drew pink, you can select that and you can see that the cost is much lower. We go to the shaders and in the fragment shader it's got just a substituted fragment shader that just draws pink. So let's disable that and go back to the shaders. So now we're in the fragment shader again and you can see that there's quite a long fragment shader. So it looks like it's processing all the pixels with some effect, I guess. The vertex shader, if you look at it, it's a whole lot of nothing until you get to the very bottom and it just does nothing. So we capture the intermediate representation and the static single assignment form that's output by the Mesa driver. NER is our new intermediate representation and this MDA is what's actually sent down to the hardware. The same thing for the fragment shader, you can see exactly how the shaders are compiled. So this is very helpful for a driver engineer or I guess if you're an elite OpenGL programmer maybe you could make sense of this. So we spoke about the batch. This is an example of the batch. If you look at a handful of renders, you can select one and you can see this is the binary packet that's sent down for the rendering. Again, more for driver developers. Alright, so let's go back to experiments. If we look more closely at these renders, let's look at the render target. You can see that if we stop at render that means it's going to show the render target immediately after this render. If you advance through these renders and you can see that it becomes progressively blurrier. So there's a little blur and it's going to get even more blurrier on this render. And then finally it's going to compose those blurry images based on the depth of each pixel. So in the background there's a light here that's quite blurry and if you look at the first render it's in sharp focus. So it's a depth of field effect that they're achieving with these final renders. It's just one example of how you can experiment. So this is an expensive pixel but it may just be expensive render. It may be expensive because there's quite a lot of pixels. So if you want to look for expensive per pixel metrics you can graph on the second axis. So I'm just going to narrow the list of metrics that are displayed. So now the width of each bar represents roughly how many pixels are drawn. And so you might look for narrow tall bars representing very expensive renders. So let's disable this one to make it larger. So you might focus in on this tiny shader here which I guess because the way it's drawing this particular texture it's not very many pixels at all but it's quite expensive per pixel. Alright so what I want to do now is explore a little bit. So let's go to standard bar and I'm going to look for vertices. So I want to go and look on the render target for where our heroin is rendered. You can see the different render targets that are drawn in this pass. And if we highlight we'll see that those are the renders that are drawing her body. And so we'll start here I think. So this is the full rendering of the character. If you clear before the render and stop after all you'll get in the render target is the character itself. So the reason I wanted to do this is to show how you can go to the uniforms. These are all the uniforms that are bound for the render. You can just change one of them, hit return, go back to the render target. So for people who aren't really familiar with OpenGL this is a really interesting way to look and dissect a more complicated frame and understand some of the techniques or how the API is used. So let's put her head back on. Oh I mentioned shaders. So let's go to the vertex shader. Somewhere at the bottom it's going to assign a color. And so let's just go ahead and modify that. I mean if I compile this I should get a syntax error saying I've made a mistake but let's do 0.0. So I'm just going to make the red channel 0 with that multiplication. And we'll go look at the render target and now we have a hulkified heroine. So that really demonstrates that you can mess around with the shader, try to figure out why it's misrendering. You can see how quickly this is. I mean the fact that you can do this in a fraction of a second is far better than what you had before with other tools. So what I've been working on recently is this hierarchical state tree. So you can collapse different items that you don't want to look at. If you don't know how I've organized them you can search for substrings like maybe I'm looking for the scissor state. If you want to go and change something the menu shows you the full set of available options in the GL for this particular blend feature. And a lot of the different GL state settings have a set of four values so it'll give you the index of each one. It might be red, green, alpha, it might be some kind of enabled flag. So if I go and disable green, I mean our heroine was green before but if I say hey there's no green and we look at the render target we'll see that she's kind of fading away. It's fun to play with but here's another one where culling is enabled for this character. That means that the triangles on the back of the character are not rendered because they're facing the wrong direction. If I change it to cull the front of them instead of the back I go look at the render target. I go look at the render target, I've turned the character around and she's decided it's too dangerous to go after the diamond and it's going to avoid disaster. Walk right back out. So that's just an example of how you can mess around with these things. One thing that's interesting if we go back and look at the final render. I've gone and changed the state but the character hasn't turned around for the final render. And the reason for that is that I actually disabled that draw with the memory barrier in my experiment. So if I turn the frame back on so it's rendering properly and look at the render target I'll see that the final frame is rendered with the changes. So that's my demo of the features. I think there's a lot more that can be done in each tab. There's a whole lot of GL state that I haven't gone and implemented but I think what I've tried to do is demonstrate that each category of state is supported in a relatively easy way to expand. And there's a bunch of experiments that need to be added but the proof of concept is there. So back to the things that still need to be done. Well one thing I didn't talk about is too much is that the fact that you can have this exact same performance profile for Windows is very important for driver developers because differences in rendering will stand out starkly when you compare two different sets of this UI running on different platforms because the renders are exactly the same. It's running the same GL calls and so you can easily find discrepancy in your implementation. Things that need to be done. There's no tab for looking at the textures. If you have, if you're texture bound having an experiment that will clamp the mit map level down so there's not so much texture data going down is important to see if you've just made textures that are too large. There's no display of the geometry or the vertices so that's something that I think is of interest to end developers to try to figure out okay maybe there's just so many vertices that I'm stuck at that part of the fixed function pipeline. The def buffer is not displayed. Unity specifically asked for overdraw and hotspot visualizations in the render target where if you've drawn twice to the same pixel in the render target it'll show up as more expensive. Help them figure out if they've got a problem with their engine. There's a bunch of UI improvements. This is all written in QML and so you have to do quite a bit of hand tweaking to get the display exactly how you want. Adding support for hardware is I think the most important thing which is what I'm working on right now and another very important thing to enable is Android. There's a whole lot of 3D applications coming to Linux platforms in the Android Play Store. None of those can be analyzed for your driver or for your hardware and so until we need to get API trace working on Android so that we can then capture the traces and then analyze them in this way on similar hardware. So I've had a little bit of help from some folks I've mentioned before. Lionel has helped me a lot with the performance analysis metrics and I think his tool, I wish it was being demoed at Osden as well because it's very interesting so if you find him here you get him to show you what he's done. One thing, when you take a GL program and you relink it, you need to reattach a whole lot of state from the previous program and that process can be somewhat intricate so for the workloads I've looked at I've done it properly but whenever there's more features in the GL that an application might have used that's where the path becomes unpaved. Radeon metrics is what I'm implementing now unfortunately the AMD performance monitor doesn't display metrics it just exports raw counters and then you need another application to go and compose those counters into usable metrics like we had displayed so that's a key problem I'm trying to fix now. If anyone's interested there's a whole lot of features that can be worked on independently and I'd welcome collaborators. Thanks for listening, any questions? Yeah, so yeah the reason that this doesn't address Vulkan at all is because there's no tracing support in API trace but Vulkan certainly could be addressed with a similar tool. There is a tracing infrastructure that's implemented by LunarG and RenderDoc has a certain amount of tracing and so there's no reason why the features couldn't be mapped on I just haven't done that yet because I'm focusing on the GL workload. Very cool tool. I was wondering how do you communicate the batch data and the ICESA details of the shader? Sure, so in the i965 driver you can set an environment variable to dump the batch and you can set an environment variable to dump the SIMD16 so we just capture that on standard out. The batch data is like I said it's so much so there's a special patch that you apply to Mesa and recompile it to let you turn on and off that environment variable just before you begin your render so that you don't have to pay that penalty for the whole render. How did I capture the frame? To get the frame you use API trace, you say API trace trace this GL workload and it serializes every single GL call into a file and so before I started the presentation I played through the frame up until frame 150 which is the one we were looking at and stopped. Any open GL program, almost every GL program on Linux if it isn't traceable by API trace the developers have then changed API trace. Yeah, sure. Yeah, that's what application engineers do all the time. They capture whatever. Grand Theft Auto and then there's actually some tear downs of Grand Theft Auto on Windows where they go through the different renders and then show you the techniques and you could conceivably go and export the vertex data and the texture data and that wouldn't be legal but you can go and hack away. Okay, thank you.