 So in the autumn of 2008, I started implementing Mesa's GLSL compiler completely from scratch. We had a GLSL compiler, but for a whole bunch of reasons that I think I've talked about in other talks at other conferences, it had to go. So I did, I'd never written a compiler before, I'd taken a compiler's class at university, but that was it. So I did a couple of months of research and I started on the project armed with two books. One was Steven Munchnick's seminal work, Advanced Compiler Design and Implementation. Now even when I started this project, that book was already around ten years old and it was starting to feel a little bit dated. It had a very small section about SSA as this is this new thing that people are starting to play around with. But everything else that was in it was really detailed explanations of battle-tested algorithms and all of the algorithms and data structures in it were related back to actual production compilers that people were shipping and the problems that people had with them and things that they had to do to actually make stuff work. So it was really, really solid practical advice. And in fact there's a JIT project started by some folks at Red Hat that's using the middle IR from that book as its basis. So people are still using this, it's still a great resource. The other book that I had in the other hand was a retargetable C compiler. So the draw of this book, this book is by a couple of university researchers that's primarily about the LCC compiler. The draw of this book was Ellberg, which was a co-generator-generator system and Burrs, the bottom up rewrite system that they had to implement co-generation in that compiler. So LCC has this tool that's part of it called Ellberg that basically implements a domain-specific language for generating code generators where you have a set of statements where you have a pattern of IR to match a blob of something that looks a bit like assembly code to generate and then an estimate of how much that blob of assembly costs. This set of code is then compiled down into some data and code that gets built into the compiler and used by the Burrs, the bottom up rewrite system to actually implement the co-generator in the compiler. So Burrs uses two passes to go over the IR trees, it does a bottom up pass where it looks at progressively larger and larger trees to kind of come up with what's the least cost way to implement this little bit and then kind of grow that up and it decorates the tree as it goes up and then it does a second pass going top down to actually emit the code. So this has two orthogonal benefits. One is that open-coded pattern matching code is really, really tedious to write and is super error-prone. So for example, on later Intel GPUs we have a flag on the sources of logical operations that says take the bitwise inverse of this source. So you can combine knots with your sources. So we can use that say with you see an IR pattern of not of A and B, well you can just invert A and B and implement an OR instruction instead. So instead of emitting two instructions you can emit one. To do that in the compiler hand-coded you end up with this lovely eye chart. So in order to fit this conceptually simple thing of the code that actually implements it, I'd use a six point font, gut out all the comments and turn it into this disaster. So without the comments if you saw this code in the compiler code it's not even really obvious what it's doing. So because of that we have very, very few sort of clever bits of code generation like this in our compiler. I think when I implemented this it helped a few thousand shaders in our shader database by hundreds of instructions per cent or two in them. So when you have things that are hard to implement and hard to maintain, like they have to have a lot of impact or it's just there's no return on the investment. So we don't have a lot of that, but we could. So the other thing that attracted us to having some sort of automatic code generation is being able to have the decorations of how much you think a particular instruction sequence is going to cost so that you can try to take a little bit more of a global view and do an optimal code output. So this is definitely orthogonal to the first benefit and some other work has been done on this that doesn't use a code generator generator. Eric Anholt has, I think one of the oldest still open MRs in Mesa GitLab since we switched over to GitLab for a thing called Nultis in the driver that he was working on that tries to apply some of this. So it's orthogonal, but the two things play together really well because if you're going to do this sort of global optimization, it helps to have a lot of different patterns so that it can try to find different routes through emitting a large expression tree. If you have a bunch of different ways of implementing that, it gives the optimizer more freedom to pick better things. So once we started digging into this years ago, the dream quickly turned into a nightmare. There were so many both practical and technical problems that we ran into. The first big problem was we'd have to implement our own domain specific language to do the code generator generator. In order to implement our compiler, we had to implement another compiler and that was a huge problem in Mesa because until very recently, we didn't have a way to be able to build something during the Mesa build that would run on your build system to generate other build dependencies just because of the way the disaster that was Mesa's build system. So it took almost 12 years before we could have even done this and that ended up being a problem for some other things like how we wanted to implement the shading language built-in functions, but that's another story. So it also turns out that source modifiers are a big deal on GPUs. We can almost for free be able to say that a source in an instruction should be negated or have its absolute value taken or both. But that's not really a thing on CPUs, so none of the research or any of the published work about how to actually implement a generator generator even considers these things and it turns out they're kind of a huge hassle. But it's super important for generating optimal code on GPUs. Most of the time, especially on the GPUs that we cared about when we started implementing this project, things like negation are free, but not always on i965 GPUs, for example, trig functions and some other complex math you can't have any kind of source modifiers on. So there's a lot of things that have to be taken into consideration of whether or not you have to resolve those source modifiers. And it wasn't even universal which things were free. So on i965, absolute value was free, but on 915, which was a thing we still cared about 12 years ago, and R200 and R300, to kind of date where this started, you had to use separate instructions for doing absolute value. They didn't have an absolute value source modifier. And if we were decorating, if we were going to have a bunch of patterns to implement all the different possible instructions, we didn't want to have to go through and decorate every single place, all the combinatorial explosion of yes, you can have absolute value and negation on both the sources of an ad and all of the, it would have just made it way too verbose and would have kind of defeated the purpose. So kind of going along with source modifiers, there's also possibility of destination modifiers, which is definitely a thing that doesn't exist on general purpose CPUs. You can have a saturate modifier that will clamp the result of an operation to the range 0 and 1. And on some NVIDIA GPUs, there's a different saturate modifier that clamps to plus or minus 1. And we've kind of come a long way in our compiler design of, in the compiler stack in Mesa since then, and it's kind of dubious whether or not having in the IR the destination modifier is a special thing is actually that helpful. And I think probably in the next year we're going to end up removing that, but it's still a thing that has to be considered because being able to get that, clamping happens a lot and being able to get it for free on instructions is a big deal. So then on the GPU architectures of the day, they were predominantly vector based for at least part of their pipeline. So swizzling was hugely important. And that's not a thing that even exists on SIMD architectures. SIMD architectures are just generally parallel, do the same thing on all the channels in their order. So reordering of channels except with some very special instructions that are dedicated just for reordering. Swizzles end up being worse than source modifiers because source modifiers, there's four combinations that you have to worry about. None, negate, abs, and negative abs. But with swizzles on a VEC4 architecture where you have four element vectors, your base, there's either 256 or 1296 possible combinations of swizzles. There's some architectures that can not only reorder XYZW arbitrarily per component, but can also select constant values of 0 and 1. And that doesn't even consider having to deal with write masks. So the whole thing was just so far out into doing basically compiler research to figure out how to do these things. There was just completely impractical. But that dream of being able to write these concise descriptions of when you see this, generate this code was so compelling that we tried several times. Both myself and Eric Anhold, each of us tried at least twice to come up with something that would do this. But every time, you know, ran into a roadblock and just gave up because we had to ship products and not do, you know, compiler research as much fun as that might be. So they say time heals all wounds. And I don't know if that's true, but I do know that if you ignore technical problems long enough, they will probably become irrelevant. So what wounds did 10 years heal? Well, first of all, we don't have to write a DSL anymore. Pretty much everything that Elberg or any of the other co-generator generators that people had ever come up with, pretty much anything that those languages needed, we can handle with just Python data. We can make something that we can make data structures that encapsulate the match this, do that. We can do that really easily with just Python data. And in fact, NER opt algebraic that's already part of Mesa is an existence proof that that kind of architecture works really, really well. There's a couple of things that it doesn't that NER opt algebraic doesn't support that might be nice in a generator generator. It doesn't support any kind of regular expressions or more complex pattern matching. You just do a static match this exact pattern and do this exact thing. And in the do that part, it's very limited in what you can put in the do that part. And there's also, it's also a bit lacking on being able to put constraints on the sources. Over the years, we've added a lot of things to how it can do constraint matching on parts of the patterns, but it can't do something like, if you have a match expression that has say two constants in it and you want to say match this pattern if this constant is less than the other constant. There's no way to do something like that. And there's a bunch of cases where that would be handy. And then also, you know, vector architectures are dead. That whole swizzling, right masking, all that nonsense just is irrelevant because all those architectures died off. The last architecture for Intel where we had to use a vector oriented thing was Haswell, which was years ago at this point. Broadwell supported it, but didn't require the use of it. And we never used that in the driver that we shipped. And there are still a few architectures that have some kind of more traditional SIMD or SIMD within a register kinds of things, like there's a handful of four by eight instructions on VC4 that do operations on four parallel bytes within a single 32-bit register. But because those look like traditional SIMD, they don't have the craziness of the swizzling and right masking and all that nonsense. So they don't, we don't have to worry about those either. They're not a problem. So looking at opt algebraic as a DSL, you have a simple pattern to match is the first field. And those can be arbitrary nested trees. You have a replacement as the next field, which are also trees. And they can each of them have variables that are just taxed. We have some convenience names for them, ABCD, etc. So you can rewrite the expressions. And then there's sort of a global condition flag that's optional. So in this case, this will do lowering. The IR has a subtract instruction, but most GPUs don't have a subtract instruction. You just do add of neg. So this does that sort of lowering pass. So the do that part of the NER opt algebraic DSL is always replace the old tree with this new tree. And replacement is a little bit of a misnomer there because nothing gets removed from the input string. What happens is it just adds some more stuff. You have a new result value and all the old users of the old result get modified to use the new result. And then we expect that later optimization passes like dead code elimination will clear out any of the bits that are no longer needed. They'll do the actual deleting part. The other subtle point here is parts of the input string might not match any patterns that exist in your giant pattern file, which is actually a good thing. And eventually as you go through the optimization passes, you'll get to a point where nothing matches. Otherwise you'd get stuck in an infinite optimization loop, which is a bug that happens occasionally if you have a series of patterns that sort of fight with each other. So if you had this pattern and had another pattern that converted add of neg to subtract, and they weren't decorated with this condition, they would just keep undoing each other. And so each pass through, they'd both activate. Yes, their optimizations happen, so keep going through the loop and you just get stuck in an infinite loop. It is a bug that happens from time to time. All right, so especially after hearing me talk up Neuropt Algebraic, it's probably not hard to guess where this is going. We can use this existing infrastructure to implement a new match this, do that kind of DSL. But there are still some kind of sticking points. The algebraic optimizations happen while the IRR is still in a very pure SSA form. But when we do code generation, we've come out of SSA form. And so there's a lot of assumptions that are made by the Opt Algebraic framework that don't hold anymore. So for example, the Opt Algebraic can't handle source modifiers in the matching. If it encounters a source modifier at any point while it's trying to match a pattern versus the actual IRR, it just says I don't know and gives up. So what seems to be sufficient for code generation purposes is to, if you encounter a value and that same value but with some source modifiers, is to just treat them as they're completely unrelated variables. It's not a value and it's negation, it's a value and a different value. And just match the patterns as though they had nothing to do with each other. So it also can't handle having the destination saturate modifier. So far, I think it is sufficient to only be able to have the destination saturate modifier at the base of the match tree. So in the subtract, I guess that isn't a very good one because there isn't much of a tree there. So the root of the tree could have a saturate modifier. And if you encountered anything during the match farther down in the tree, you'd say, okay, it doesn't match. So far, I haven't found any cases during code generation where it would actually be beneficial to detect saturate modifiers at other points in the tree. The couple cases that I came across where maybe something clever could be done, it turned out it was actually better to just implement an algebraic optimization to change what the tree was going to look like before the tree was in, before the IR was in the state where you have those as modifiers where saturate is just an instruction. Do the optimizations earlier on and then later on you don't see that stuff during code generation and it ends up being much, much simpler. That may change as we come up with more clever code generation patterns. And if it gets to that, we may have to just stop using the destination modifiers, which we may end up doing anyway. It's still an open question. So Opt algebraic can't handle non-SSA destinations. So after we've come out of SSA mode, if the shader has control flow, there's going to be places where writes happen to registers instead of to SSA values. That's fine. It's completely safe to match the final thing as writing to a non-SSA value. So the patch that I referenced there is it just deletes an assertion out of the matching code is all. And so that's pretty easy. But going along with that, it also can't handle sources that are non-SSA values. And we have to be able to match those because we have to be able to generate code from that. So far I think that it is safe to match a non-SSA value in an expression tree if that non-SSA value only appears once. Because the problem is, if you have the SSA values by their definition, they don't change. Non-SSA values, if you see two reads of them that are separated at all, those reads could be reading different versions of that value. So they might be different values. So if you had the same thing appearing multiple times in an expression tree, each of those might be different. And so the tree doesn't really match. You might be generating bad code. So for example, if you had a pattern that matched A plus A and replaced it with A times 2. Well, if those two As were non-SSA values, they might be different values. So replacing that with one of them multiplied by two is going to give you an incorrect result. So I think that it's safe, and I have some results later on that help convince me that it's safe to, if a non-SSA value appears only once, that it's safe to just match it and go on with life. The need for being able to have non-SSA sources in the trees poses another problem for the existing matching engine. But I'll get to that again in a couple of slides. So the existing infrastructure has really, really limited type tracking. So the values themselves in NUR don't have any types. They're just a blob of bits. They only have a type when they're used by an instruction, and the instruction dictates what the type is. So we have separate floating point add and integer add instructions. So if you have the same values as sources to both those instructions, in one of those cases that value is interpreted as floating point, and in the other case it's interpreted as integer, even though it's the exact same sequence of bits. So where this becomes a problem is, for example, on Intel GPUs, we don't have a specific instruction to convert integer to float. What we have is we have a move instruction, and you mark on the move instruction that its source is integer and the destination is float. So when you get into the do that part of the patterns, you need to know, I'm just emitting a move instruction, but for this source I need to know what type it was being interpreted as in the pattern that got matched. So there's some additional tracking that needs to happen so that the values that are used in the do that part can get back connected to where they appeared in the original matching tree. So like I mentioned before, opt algebraic doesn't require that everything in the input matches something in the patterns, but if you're generating code, everything that exists in the IR had better end up as some kind of instructions in the output, or you're just dropping stuff on the floor. So it isn't a problem per se, but it's something that when you're writing the code generator, the machine description for the code generator, and when you're implementing the other infrastructure for the code generator that you have to be aware of so that you can detect, I sent some IR into the matching engine and it didn't match anything. You need to generate some kind of an error or do something because you're about to do something bad. And then the other thing that's kind of a quirk, it's not really a problem, but it's again something you have to be aware of is in opt algebraic, the input language and the output language are the same. But the whole point of doing code generation is translating from one thing to a different thing. So what this means in a practical sense is about half of the infrastructure that exists in opt algebraic is completely unusable for a code generator. The only part that is actually reusable is the matching part. We found a match, now let's do something, isn't any good to you because we're doing something different. And it turns out that about 45% of the work of this project was completely rewriting all of the infrastructure for doing the do that part of the process. So the other difficulty with non-SSA values that I alluded to before has to do with the optimized matching engine in opt algebraic. So the algebraic optimization happens a huge number of times during the main optimization loop, and we found that it was a big chunk of the time was spent in just doing all of the matching. So Conor Abbott wrote a really, really optimized matching engine that uses a modified finite state machine to very quickly do the matching, but the way that it implements this is it uses the SSA value numbers, the index on each SSA value, as the main key for the entire engine. So if you have to match things that don't have SSA value indices, you're stuck. The whole infrastructure just completely falls apart. So we can't use any of the super optimized matching engine. That's okay because while algebraic optimizations might happen hundreds of times while compiling a shader, code generation happens once. So if we have to use a slower path for that, it's a thing that happens once. I don't expect that it's going to be a big deal. Okay, so the output language. There's a few goals and requirements as I was implementing this. The first was I really wanted the thing that a human authors to look as much like the actual GPU assembly as possible. Now it's being implemented as some kind of Python objects, but I wanted to make it so that when you looked at it, it looked like you were looking at GPU assembly because the people who are going to author these things and try to debug them and maintain them are going to be familiar with that. That's the output that you see when you're debugging shader compilation issues. It's the format that's in the hardware documentation. It's what we get when we look at output from shader DB runs. That's the language that we work with. So let's make this other thing look like that just to reduce the cognitive burden of having to understand these patterns. So this can be challenging because a lot of the instructions and the way that register accesses are encoded and a bunch of other things about the GPU architectures are weird and complicated. So getting those to look civilized as a bunch of Python objects is tricky. So the other thing that's nice is you really, really want to be able to represent as much of the weird quirky bits of the architecture as possible. You don't necessarily have to get every weird quirk, but it turns out that those weird corners of how a particular GPU works, those are the cases where you're most likely to find opportunities to generate smarter code. The way that you can access subparts of registers or mix types and instructions or do other kinds of weird things, that's where you can generate kind of clever stuff. So it's beneficial if you can just represent that. So as I said before, in Python you have to be able to also represent these things as a list or tree or some kind of collection of just Python objects. The whole point of the exercise of not implementing our own DSL is we can just use Python. And then in C, this is going to somehow generate out C code. We want it to be represented as either data or code or some combination thereof. And I think ideally if we can generate a set of compact data tables to represent the transformations and the patterns, that's kind of ideal. When the code generator is hand-coded, the existing hand-coded one, the single generator covers all the different GPU generations, and that's very practical in that case. But when you have, instead of a hand-crafted code generator, when you have a machine description file, essentially, that describes translation of patterns from the IR to a set of assembly instructions, it's more natural to have a bunch of separate files that just cover a single architecture. And so kind of the thing that naturally flows from that is to compile each of those machine descriptions into its own blob of data. So instead of having one big chunk of C code, you're going to have a bunch of pieces of data. So having those five or six or seven giant piles of data be as compact as possible is beneficial. So here's how the output language looks as the list of Python objects. This implements the F sign instruction for 64-bit floats. And I think that this covers... I picked this one because it hits almost everything that the output language can support. So the first thing it has is it declares a temporary register because in this sequence of instructions it's going to generate a temporary value. And then it has an instruction that's going to load that temporary register with the double precision float immediate zero value. It does a comparison that sets the not zero conditional modifier flag. It does another move. It does an AND that a bitwise AND instruction that only uses the upper 32-bits of the 64-bit destination and the upper 32-bits of the 64-bit source. And then it has an OR instruction that is predicated so it only executes if the compare instruction actually set the not zero flag. So to pretty much anyone who's familiar with the assembly language of this GPU, what this is doing looks pretty obvious, right? If you don't know this GPU assembly language it's just a pile of gook, but you're probably also not authoring or maintaining one of these machine description files. But it covers pretty much everything and it does it in a compact, concise way. And it's just a bunch of Python objects. The instruction generates an object and then the predicate method modifies that object. It does things in a pretty clean way, I think. So what I ended up doing with this is the output of when you run the Python scripts to convert the code generator, or convert the machine description into C is they generate a couple of tables of bytecode. There's two separate tables. One is a table of immediate values because those could be up to 64 bits for each immediate value and then a sequence of instructions which each instruction is much smaller. I don't remember if I managed to get it packed into 16 bits or 32 bits. And so the bytecode, each instruction in the bytecode does a very simple thing like declares an opcode and then there's a separate instruction that declares the destination with all of the flags and modifiers that can be on a destination and then another one for each of the sources that also has all of the flags. And the C code that then converts that bytecode into actual machine code ended up only being 203 lines, including the license header. So maybe 150 lines of actual C code. So it's very small and very concise. I ended up creating six different machine description files to cover all the various generations of Intel's GPU architectures and the files are organized so that it's really easy to just do a diff of one generation's machine description and the next generation's machine description and you get a very compact output of, oh, here's all the stuff that changed from Gen 7 to Gen 8. It also makes it really easy when you need to add the next generation. You just copy the old one and go to town making edits on it to add new features or remove things that use features that got removed or instructions that don't exist or whatever. Once I got the first machine description written, cranking out all the rest of them was pretty easy. You know, it's always the most work is to make the first one. So remember back a whole bunch of slides, there was the eye chart. So we can now look at what it looks like to do all of that same kind of optimization but using the machine description. It's still a lot of stuff, but nowhere near as much. I fit it on a slide in a 10-point font instead of a 6-point and I didn't have to delete any of the comments. But I want to just look at one of the sub-patterns here. So this says if we have a knot of a logical instruction and sources that don't have any source modifiers, we're going to emit a single instruction that's the other logical instruction. So if we had an and, we're going to emit an or with all the same sources but we're going to set the negate flag. On this particular architecture, on a logical instruction, if we have a negate source modifier, it instead does a logical knot. And then all the rest of this is just to handle one sort of annoying quirk that came out of this is all of the patterns have to match for specific bit sizes. So there's a lot of redundant patterns that ended up existing in the machine descriptions. So in this case, I have to generate separate patterns for 8, 16, 32, and 64 bits because I had to know what types to decorate on the sources. So it was a little bit of a hassle and so that's why this ends up being a little bit bigger than it should have to be. There may be a solution to that, I don't know. It didn't end up affecting that many instructions that I wanted to generate so it wasn't too bad but there's always potentially room for improvement. So we're talking about replacing a pretty major subsystem in the compiler and whenever you're replacing a big subsystem with something new, there's one important goal that you really should have in mind that people don't talk about a lot and that's minimizing disruption. If you're going to replace this, if you're going to replace some existing code that works with a new shiny, better version that's going to make some things faster, you'd better make sure that it doesn't substantially hurt other things. And if some things are hurt, you have to at least make sure that it's minimal damage and that the pros dramatically outweigh the cons. So when I ran just the straight machine descriptions as I authored them without adding any new optimizations across our shader database which has about 50,000 shaders from real applications there was no change in the generated code. So that felt pretty cool. I ended up, the impact on the Mesa source was about 1,400 lines of deletions and 4,800 insertions and a huge chunk of that was just the machine description files because there was six of them and each one averaged about 500 lines. So there's some debate about whether or not we should extract common bits from the machine description files and have them stored in a separate file that then basically just gets included. I think it's better keeping them standalone where each one is entirely independent because as the architectures change less and less is truly shared across a large chunk of the architectures so it kind of has diminishing returns as time goes on. But I don't know, it's still a debate to be had. So and then towards the end of the project I added a couple really simple optimizations that would have been a hassle to add in the hand coded code generator and those helped a few hundred shaders by a couple percent. But at this point almost no work has been done on finding patterns, new optimization patterns and clever tricks. But I have a big list of things that I want to try and I know that there's just from having looked at lots of shader assembly over the years I know there's a lot of stuff that we can do. So future work, well wait, no, no, finishing current work. There's still a couple of the patches that are marked work in progress. The big one is like I had mentioned before about deciding what we want to do about the machine description files keeping them standalone or trying to extract out common parts. And there's a couple of patches in the MR that show what some of that would look like if we pulled out the common parts. The other thing is there's still a couple of the patches that I think could really benefit from some unit tests. In particular the one that enables non-SSA sources. I think that that's safe but I would like to see, I would like to spend some time trying to come up with some counter examples and try to encode those as unit tests. So I think I'm running out of time so I'm going to try to go through the actual future work kind of quickly. So one thing, some instructions can't have source modifiers, some instructions can't have immediate values as sources and there's a handful of other limitations like that that are currently handled by adding a bunch of extra patterns and putting in a bunch of complex stuff. So for example on the Intel GPUs where the negation source modifier on a logical instruction means take the logical not, if you have a source that has the negation source modifier on it we have to emit an extra move instruction to sort of resolve that negation. I think that it would be better to encode that a different way by having a set of patterns that say if you have an immediate value you can convert that into something else by having a move or if you have source modifiers you can eliminate those source modifiers by adding a move and then simply decorating some of those patterns as can't be an immediate or can't have a source modifier and then having the pattern matching engine first match the resolve kind of pattern and then match the other pattern. I think that would end up making things a lot cleaner and would help support some of the other future work that I think needs to be done. So integrate code generation with CSE so as you're emitting instructions do the common sub-expression elimination at the same time so you don't re-emit the same instruction in each basic block. Right now we have a separate CSE pass that we'll just go and clean that up but we could just integrate it directly with the code generation. That by itself doesn't buy you very much but it does make it much easier to do a dynamic cost estimation where if you have a sequence that's going to match and you know what this output pattern is you want to be able to estimate what the cost is going to be if you know that well I don't need to admit these three instructions because they already exist in this basic block I can just reuse those values that's going to give you a completely different cost estimate than if you had to admit all the instructions in the block. And it's really hard. The reason that I specifically say dynamic cost estimation a lot of the existing research about co-generator generators use static cost estimates but GPUs are complicated. Like it's really hard. If you have more than like two instructions in your replacement pattern it's hard without any context to be able to estimate how much that's actually going to cost. So the cost estimation function it has to be dynamic or it's just lies and at that point like why are you doing it? And then finally we get to great success of we have cost estimates so then we can integrate it with something like NULTIS or some other BERS or some existing optimization scheme that tries to pick the best sequence of instructions for a given set of expressions. So then I've got a bunch of links to all the MRs and the work at Red Hat that I mentioned. Yeah, there's an extra MR in there that contains some prerequisite patches for the actual co-generator MR. And then Eric's very old NULTIS patch series. And I think I finished within time. Yay. Any questions? Yes. In fact you're in such a way that you could, instead of having like common code instead have it for like an older version and then like have a newer file and only the things that have changed in that version being a newer file and that overrides what's in the old version. Right. So what I did and there is a patch in the MR that basically, okay so the question was is there a different way to architect the machine descriptions so that you basically have common stuff and then in the machine descriptions for newer architectures you just override other things. So what I did is common parts that I pulled out, I would pull out a common part and I would make a separate Python list for the common stuff and I would give that list some name. And then in each generation that was going to reuse that list I would just concatenate it in with the other list. So I would have sort of a concatenation of lists to pull in the shared, the things that could be shared and then stuff that couldn't be shared was just listed out in each machine description. That seemed like the cleanest way to do it. As just Python objects, as just a list of Python objects I don't think that there was a good way to override things that were in the list but I'm definitely not, I'm not the Python ninja on the team so there may be a way. Do I have time for another? One minute left. One minute left. Anyone, yes in the back? Yes. So the question is if you have multiple different non-SSA sources that the value that you want, if you have a large enough tree and you have multiple of those sources that there might not be a point in the program flow where both those values actually exist. Maybe? That's true. And that's why I want to, I mean that's one of the things that I want to sit down and spend more time and try to find some actual concrete examples of sort of counter examples. There were a couple other, I mean the thing that bothered me about this is, you know I implemented that bit and I had that nagging voice in the back of my head of are you sure that that's safe? It seems like there's a bunch of holes here you might fall in. I'm like, alright, well I have this giant database of thousands of shaders and some of them are really huge. Like there's some compute shaders in some games that are end up generating like 10,000 instructions and they all just compiled and didn't explode. So I can't find a real workload that disproves it so it's hard to, yeah. And I'm being told that my time is up so I'm gonna stop so I don't get in trouble.