 Hi. Excuse me. Hi. My name is Ian Ramanick. I work in the team at Intel that does the open source drivers for all of our GPUs. And I've been working on Mesa and other drivers since 2001. And today I'm going to talk about some recent work that we did to reduce peak memory usage in a couple of the compiler pads. For the compiler pads for the shading language compiler. So this is basically a talk about optimization. So I'll bring up a couple of the rules of optimization and I think everyone knows the first rule of optimization is don't. Or maybe don't unless you really, really have to because at some point things are going to happen in your project. People are going to start using it in different ways, trying to do more things with it. And you'll have to. But any work that you would do on optimization before you reach that point is going to be speculative. And it will probably be wrong. You'll do the wrong optimizations. And so in the best case you'll either waste your time and in the worst case you'll be optimizing for things and making the stuff that's actually important worse. So we ran into a case where we had to do some optimization recently. Some of our new GPUs that are coming out for Gen 11, they have removed on some of the lower power parts. They've removed support for double precision math in the execution units. But FP64 is still required for OpenGL 4.0 and all the later versions. So we're not just going to dump from OpenGL 4.6 or 4.5 all the way back to 3.2 because there's no double precision. So we're going to implement it in software, just like you used to do in the olden days before CPUs had FPUs. And the performance isn't going to be great, but so far we've literally encountered zero applications that use this feature of GLSL. See first rule of optimization. Nobody cares about this feature, so it's okay if it's slow. And that's a big part of the reason why it's not a required feature for Vulkan, because just nobody actually wants it. So late in 2018, work on soft FP64 was getting pretty close to done. Pretty much all the test cases were passing, but the guy that was working on it noticed, huh, there's a handful of these tests that I'll start them running, they're going to run for a while, and I'll go make a sandwich or something and come back, and OomKiller has just wrecked my system, like what the hell. And we tracked it down to a couple of test cases that seemed pretty innocuous. So we mocked those tests up and ran them. We have this big, like 80 core server that we have for doing big compiles and some shared DB runs and some other things on. It has 80 cores and I think it's, I put in the slides 128 gigs of RAM, which I think is right. We actually got the test case to run the completion on that and it peaked at 80% memory usage. So it was, I think, 85 gigs of RAM, which seems really bad. So we kind of dug around and tried to figure out what was going on, because we looked at the shader that was in this program. And in this particular test case, it has a bunch of uniforms of every possible double precision type, so the scaler, all the vector types, and all of the square and rectangular matrix types. But then it has to actually use all of those or the compiler is supposed to dead code eliminate anything that isn't used. So there's, you know, a bunch of math using these. The shader would fit on a single slide, but the compiler right now inlines every single function. So this tiny little shader just explodes into this huge pile of code. The thing that turns out to be the real disaster of it is this tiny shader that starts off with no flow control ends up with a little over 16,000 basic blocks in it, because each of those functions for doing FP64 operations, they all have to check for things like should it generate NAND, did you end up with a D norm that needs to be flushed? There's all these sort of exceptional cases that it has to check for, so all these functions have hidden flow control in them, so that all just explodes out. So, you know, most shaders that come from real applications aren't usually that big, and even the ones that are big, people have tried to optimize them to not have flow control, because generally lots of flow control performs poorly on GPUs. So we have this extraordinarily huge shader that doesn't look like other huge shaders that applications would normally give us. I guess no one should have been surprised that there was going to be some kind of problems with it. So the second rule of optimization is, you know, optimize the right thing, and I sort of condensed it down to a little bit simpler test case that I ended up submitting to Piglet for inclusion in Piglet that is able to hit the problematic paths without having to rely on the soft FP64 code because that hadn't landed yet, and also that I could run on my laptop without, you know, needing 85 gigs of RAM. It's one thing to have a passable test case, and it's one thing to just wreck everyone's systems, and also an entirely different thing if I completely wreck our continuous integration system because the guy who maintains that will come find me. You added tests that wouldn't kill everything, why? So for collecting data, I turned to Valgrind's massive tool, which is a, like, I'm going to expect that most people are at least a little bit familiar with Valgrind. You run it to run your app and it kind of inserts itself in pads in your applications so that it can collect data, and so what Massif does is it collects data through time about every memory allocation so that then at the end it can sort of show you this timeline of here's how much memory you were using and when, and I'm actually going to show some of that. Let's see here. All right, so here's the, so you run Massif and it collects some data, and then you have another tool called MS print that you used to actually display the data in a human, human readable might be an overstatement, but human puzzle outable or something. So the important bit here is the first thing that you get is this timeline showing how much memory usage you had through the lifetime of the program, and you can see right at the beginning there was this huge spike up to around five gigs that then dropped off and then there was a couple of smaller spikes a little bit later on, but the big peak is the important one. We can go down and we can look at the output where it will sort of show where memory got allocated so you can see who the big consumers are, and so right here 98% of the memory that's allocated was allocated, I guess no, 93% was allocated out of this same function, nerfee builder add value, for some reason that was hard to say. So that seems like a smoking gun of maybe look here. Okay, so fee builder is, ner is the kind of the mid-level IR used in Mesa's shader compiler. It's SSA based, and so fee builder is one of the, is the data flow analysis pass that inserts the fee nodes in the SSA form of the program. So we've got a couple of things that we could spend some time optimizing. We know where all the memory is going, but then that doesn't necessarily tell us where to go start writing code. And when deciding what to do, it's not as easy as just, you know, apply Amdahl's law and go work on whatever thing that points at, because in a real project there's other concerns, risk, amount of effort involved, schedule issues, all those things matter. So we kind of had three places to look. We could optimize the input shader, which in this case is the mangled input shader with all of our soft FP64 stuff inserted into it. And we looked at that and it was really unclear if adjusting the soft FP64 code would have that much effect on the final memory usage. We could probably adjust a few things and bring it down a bit, but that code's actually fairly complicated and it has a lot of twitchy edge cases that it has to handle. So it seemed like even if we could get a good amount of benefit from that, that by adjusting that code, because we're not floating point experts, that we might suddenly break things. So we didn't want to do that. And strictly speaking, the reason that so much memory is used in the first place is because every single function gets inlined always. So we thought about, well, you know, if we stop punching the fee builder in the face by giving it 16,000 basic blocks because we inlined the whole universe, I mean maybe that would just make the problem go away. But once we get down to a certain point in the compiler stack, functions have never existed beyond that point. So through the whole backend compiler and all the instruction generation, there's no support for functions at all. So there's a huge amount of code that we would have to go right to support that and we needed to ship something. So we decided we'll take the obvious approach and we'll go work on optimizing the memory usage of the fee builder. So I'm not really going to talk very much about the process of putting a program in the SSA form and adding fee nodes because it's complicated and I don't understand it even that great. But at sort of the high level, what the process of inserting fee nodes involves is you're going to analyze each variable in the program and you're going to look at every basic block where that variable might get modified. And then at the points where through the flow control graph where multiple paths of those modifications could come together, you're going to insert a fee node because eventually what's going to happen is those won't be rights to the same variable, they're going to be rights to new variables and then when you get to that join in the control flow in which route you actually took, the compiler is going to get modified where that variable gets modified. Now, you can assign an ordering to basic blocks in a program and NUR does this and it sort of assigns an index to each basic block. So the fee builder just says, well I'll use a simple data structure that I can index with a unique value which just has an array at the bottom of the basic blocks, essentially of the basic blocks, it's actually the rights in the basic blocks and just indexes it by basic block. So this works great when you have normal programs, when you've got programs with 16,000 basic blocks, now you've got an array of 16,000 pointers for each variable in the program and almost 15,990 of those pointers are going to be null. So replaced that simple array with a hash table and it slashed the memory usage. We went from just a little over like around 5.4 gigabytes to about 1.3. I was going to show the massive output after the change but I'm running a little bit low on time. I might come to that. But the cool thing that shows up in the after massive output is that first peak is basically gone and that part of the program is no longer the long pole. So there's a couple other peaks later on but fee builder is not the critical memory usage path anymore. So at this point we had cut the memory usage enough that we probably could have stopped but I continued on to look for some more low hanging fruit and found that, so we have a fairly complex system in Mesa throughout the compiler stack where we have sort of a self implemented mark and sweep garbage collector. It's not that exactly but it's roughly analogous so we won't ever actually leak memory but we can have transient leaks. So along with a program, NER will track a bunch of metadata that gets used by optimization passes. So one of the things that it tracks is live ranges for a value. During what range of the program does this value need to exist because it might be read. And various passes will change the shape of the program and will invalidate that data. So if you have live range data and you delete a read of a variable because you can optimize out an instruction, well now the live range changes so the live range data becomes invalid. So there were a bunch of these passes, a bunch of the optimization passes that would mark this piece of metadata is invalid but then the data would continue to exist. And by looking at, by really digging down pretty far into the massive data I kept noticing that there was some of the metadata that showed up at points in compilation where we were never going to need that metadata again. And it turned out that it still existed there because it got marked as invalid and then no one released the memory. So changing literally added four lines of code and that cut another third of a gigabyte out of the peak memory usage from the worst case shader. And then just to continue on looking at low hanging fruit I did a couple of micro-optimizations using PA-hole or PAH-hole or however you want to pronounce it. It's kind of a cool program. You run it on your object files and it will analyze your structures and tell you exactly how the compiler actually laid that structure out. And so it will tell you where there's holes in your data structure. So for example here, after the tight field of the structure you get proper alignment of the pointer that follows that the compiler inserted four bytes of padding. So it's just dead space. And if you rearrange the structure a little bit you won't have any of those dead holes. And it's definitely a micro-optimization. In this case it ended up being pretty useful. I didn't have any data about it which I should have collected because the NUR instruction is the base of every single instruction in the IR. So if you've got a thousand instructions in your shader you've got a thousand of these. So having that four bytes just wasted everywhere adds to death by a thousand really tiny cuts. So in this case what I did is I marked the enum for NUR inster type as packed so that instead of taking up a full int it would only be a byte. And I shuffled some things around. I think I moved the block pointer up to right after the exec node. And then they basically sorted all of the fields by the size of the underlying type and that eliminated all the padding. And I think it cut, what did it do? It cut eight bytes off the size of the structure? I want to say? Four or eight because of padding. And it adds up. So possible future work, there's still a bunch of places where we use way too much memory. We could implement real functions and I suspect we're going to have to but that's going to be a huge bunch of work. There's also another data flow analysis pass way, way down in the back end that operates at the machine code level that does a very textbook implementation of the data flow analysis algorithm and so it has these huge bit vectors with one bit per variable in the program and then you have a copy of each bit vector for each basic block in the program. There's other algorithms that you can use for that that don't need to have these massive bit vectors. I had looked a little bit at because most variables won't be live or accessed in most basic blocks. A lot of the bits in the vectors are either all zeros or all ones. So I tried using a sparse data structure to track that more compactly and it shaved off about 7% of the memory usage but since it was a more complex data structure instead of just a simple big flat array, the 7% memory savings came with a 7% run time cost. It wasn't the trade off that I wanted but if we basically chuck that algorithm and replace it with something that's not just from Compilers 101, we wouldn't take the performance penalty and it would use a lot less memory. How many minutes do I have left? Two? Okay, all right. So here's my pointer. So then here's the after graph. So the first spike is completely gone and there's just a couple of spikes later on. The peak spike is actually during register allocation. There's a shared component in Mesa for a graph coloring register allocator and it has some big data structures in it too, especially when you've got lots and lots and lots of basic blocks. I haven't analyzed that code to see if that can be helped very much because it's really complicated code and I'd rather not go in there if I don't have to. But okay, yeah. So that was what I had for that. Okay, any questions? I'll start here. In retrospect, would you consider saying to the hardware guys again to re-enable FB64? No, because I'd rather have that chip space used for stuff to actually make programs that people care about go faster. And especially for... those are the really lower power parts. They're mostly targeting open GLES that doesn't have FB64. So I mean we... it's frustrating that we had to do all this work and then had to do a bunch of other stuff to make the work actually work. But I think it was the right choice. And then do I have time for the one more question? I have the same question. Oh, okay, all right. Then you get the same answer.