 who made this possible, but for example Lucas apparently outside still doing his organizing thing. Quick word about myself, I'm a PhD student by day and then by night I occasionally do some shader compiler hacking. And that's what I'm going to talk about. Okay, so here's the plan for my talk. First of all, very quickly what's GLSL that probably most people know here, but just very briefly. Then I'm going to talk about the radiant hardware R300 to R500 from a compiler writer's point of view. Then kind of the main part will be some overview of what the compiler looks like right now and some thoughts on how we got there. There will be a part on what is missing for GLSL and how can we get there and some final thoughts as well. Okay, so would you please all raise your hands for me for a moment. Okay, those please just raise your hands. Okay, those of you who have worked with GLSL or the TGSI assembly, please put your hand down. Okay, so most people actually have, but there are some who haven't. But that's a nice trick to get everybody first to raise their hand. Okay, so here's a very rough overview of what the OpenGL pipeline looks like. You know, for vertex fetching from vertex arrays, you have transformations that are applied. Then in the latest versions, you have a geometry shader which can again modify that output, but the hardware I'm going to talk about doesn't have it, so we can forget it. Then approaches are assembled and resturized, the resulting pixels or fragments are again shaded and it all ends up in the frame buffer eventually. And the point is that these yellow boxes are programmable. As you see on the right-hand side, there is some C-like language that you can use to modify the functionality of these yellow blocks. And what we need to do as driver writers is to get from this textual representation of some C code, well, C-like code, to fill in a structure like here, which just contains the binary machine code that the hardware understands. And this compilation step is what I'm going to talk about. And then once this is once compiled, then every time we do rendering using that shader, we just use these stored binary values and send them off to the hardware. Actually, the GLSL, the first step of compilation is in Mesa entirely independent of the hardware. It generates an intermediate assembly language, which is also used for fixed function old-school OpenGL for the ARB extensions, assembly extensions. And in the case of Gallium, if you have some other state tracker like maybe the XARC state tracker, which is being hacked on, then this also ends up in this assembly language form. So what we need to do is we need to take this assembly language and put it into machine code. And the thing is that we also need to do some optimization steps because each hardware is a little different and the assembly that is generated by the compiler may not be optimal for what we need to do in the hardware. And here's just some example of what this assembly language looks like. So it's very self-explanatory. You have some instructions like move, subtract, multiply, and this is the assembly. Okay. By the way, if there are any questions, of course, feel free to interrupt me and ask me. Okay, so about the hardware. I'm going to talk about R300 to R500, which are supported in NASA by a single driver. Or actually, there is a single classic driver and there is a single Gallium driver, but okay. Here's what the marketing terms that this roughly corresponds to. In case you haven't seen it yet. And let me just say that the newer chips like Radian HD and onwards are very different in terms of programmability and I'm only going to mention it shortly at the very end. Okay, so we have a programmable vertex shader. That's the first yellow box you've seen. And the hardware there is very close to this assembly that we use intermediately, which is quite nice. Another nice thing is that there aren't many differences across the hardware versions from R300 to R500. And the differences that are there are all in terms of new features so they're backwards compatible, which makes life easy for us. Let me give you an idea of what a PBS instruction looks like. So, first of all, you have a bunch of register files here, indicated on the right-hand side, in which, well, you see it. Most of them are pretty standard except for this strange alternative temporary register file here. This is just the second register file in which you can store temporary values, which has some different restrictions than the other temporary register file. And then instructions go basically in three steps. The first step is to select the up to three operands for the instruction that you want to use. Most of them have two, but multiply and add has three. We can select first with register we want to use. We can then take absolute values. We can do swizzling, which means exchanging components or replacing a component by zero or one. And then we can do component by its negation. And this is very nice because it's very flexible. In fact, you get the swizzle instruction that we have in the assembly. You get it for free. Then the instruction is executed and then you have some post-processing as well. Our 500 is a bit more powerful here to support flow control if else, and if particularly in a nicer way, then we can do it on our 300. And then things are stored, of course, in the registers. The machine code is just a bunch of bit fields and basically each box that I've drawn here on the left-hand side corresponds to one bit field in this machine code. So this really corresponds to our view of what the hardware does. Okay, so from my point of view, what's good about it. I've already said the very flexible swizzle support. Most instructions that we want to implement are supported natively by this hardware, which is not the case for fragment programs, especially an older hardware, where you have to emulate instructions like sine, cosine, and so on. There are some not-so-nice things. The worst thing is that there are some operand restrictions. So if I go back here one slide. You have up to three operands, but you can only use one input register at a time. You can only use one constant register at a time. And if you want to use more, then you have to use some kind of spilling moves. You can also only use two temporaries at a time, which means that it would be nicer to have, if you have a multiline add, it would be cool if we could put two of the operands into the temporary file and the other one into the ultimate temporary, because then we could do it in one cycle instead of using a micro-instruction that takes two cycles. But this is an optimization that we don't do yet, because lack of manpower basically. Also a nice feature that this processor has is that you can, under certain limitations, you can combine a vector instruction with a scalar or a trig instruction. But the limitations are kind of nasty. And again, because of lack of manpower, nobody has the time to really make use of that so far. The fragment processor, it's called US in the AMD documentation for some reason, I'm not entirely sure. The weirdest thing about this piece is that the arithmetic unit is split into a 3-vector part for the 3 component part, sorry, for RGB components and one scalar part for the alpha component. What's a bit, well, tricky but we got used to it is that there are many changes, so we're going to R500 in terms of how texture instructions are scheduled, in terms of additional features, flow control. But the nice thing is that the ALU, let's say, philosophy of having this RGB and A-split has stayed pretty much the same, which makes it easier for us to share a lot of code. And again, I'll show you a similar picture in 1v4. No wait, there's another thing first. Texture instruction schedule is an interesting problem as well. Because in R300 you don't have a sequence of texture and ALU instructions that are intermixed. Instead you have one set of registers into which you can write texture instructions, one set of registers in which you can write arithmetic instructions. And then there are additional bit fields that tell the hardware that, okay, please execute the first four texture instructions and then please execute the first ten arithmetic instructions. And so on like, I try to visualize here, you have one block of texture instructions and they alternate. And the problematic thing is that you have a very limited number of blocks. In the R300 there are only four blocks of texture instructions and four blocks of arithmetic instructions that you can use. So you have to be careful to try to group texture instructions so that they run at the same time. Otherwise you might not be able to support even rather simple shaders. I think there was one bug once about a compass plugin that used five rectangle textures. And the thing is that rectangle textures need to be, need to get coordinated scaling via arithmetic instructions and then you had arithmetic text, arithmetic text and run out of blocks. So one optimization you had to do was to move all these texture instructions together. The R500 is nicer in that respect. There you really have a normal sequence of instructions. The instruction format is unified, five words for instructions. It's very nice. There's some potential for optimization with doing manual synchronization between texture and arithmetic which should be rather simple but nobody has bothered so far. Yeah? Is there still a performance improvement if you group the textures even though the R500 doesn't matter anymore? That's an interesting question and I suspect that it does matter because you have a synchronization flag in the arithmetic instructions which tells the processor, please wait until all the texture instructions are finished. If you can do some clever grouping and maybe move the arithmetic instructions that need to use the texture result as far down as possible then you could have maybe better throughput. So yeah, it's a good point. It still somehow matters probably but we haven't done it. Okay, now here's what the instructions look like. The most important message is that you have this big vertical split between the three component vector part over here and the scalar part over there. Even the register files are, you can think of them as completely separate and similar as before, well this time you only have a constant and a pixel stack register file because the pixel stack contains both the temporary variables and is also initialized by the input. That's a minor difference. You have slightly more flexibility how you control your sources and operands. So what you first do is you select source fields where registers are loaded and then you have the ability to do swizzling across all the units which allows, in theory, for some nice hacks because you could have an operand here that uses the R component of register 0 and the alpha component of register 10, in theory. But I don't know if that's particularly useful and you can do the usual modifications then you have the instructions which are in principle separate except that some stuff like dot product needs some crosslink. Also you have the ability to take the output of the scalar instruction and replicate it over there. If you want to do that, that means you can't use the RGB instruction slot for that instruction. And well, usual output modifications and then you can write it to the frame buffer not directly to the frame buffer, of course, but to the output which is then put into blending or you go back to the temporary register. Okay. So some challenges here. I've mentioned this briefly before. There are many instructions that need to be emulated but this is relatively simple and to do that works well. There is this split which is a challenge in terms of instruction scheduling. I have some code that does it and I think it does it actually fairly well except for one problem. I've seen a lot of shaders that do something like compute the reciprocal of a scalar that is in the X register and write the output again to the X register. The problem with that is that the RGB unit can't do reciprocals. So what we have to do is we load the X component into the alpha and then replicate the result to the RGB which wastes the RGB vector slot in that instruction. There's a question of maybe we can move these components around in a clever way but that's a more difficult subject I guess and we're not doing it again, limited manpower. On the older chips you have to do some swizzling emulation but that has been pretty stable for two years now. And of course there are some little bonus features that would be nice to use optimally like what I didn't explain is this presa thing. It allows you to do something like subtract source 0 from source 1 before doing the actual instruction. This allows you to do something like linear interpolation in a single instruction instead of using a multiplication and a multiplication plus addition. It would be nice to have, there are some limitations there because you're not as flexible with swizzling when you want to do that which is the main reason why you've been lazy so far in supporting that. There's a picture about flow control. You have the issue that when all pixels want to jump in a branch instruction then it's fine because it operates in many pixels separately. If none of them want to jump, it's also fine. If some want to jump and some don't then you actually have to twiddle with some, deactivating some pixels temporarily and use both branches and then it else ends. But I'm not going to elaborate on that too much. The nice thing about flow control support in the RF500 is it's one very flexible and it's very easy to map GLSL onto the hardware actually. There are some other challenges which I'm going to mention later. There are lots of possibilities for optimization there but we can think about that later. Okay, so far for the hardware details. And now I want to give you an overview of, well, a high level overview of how the compiler works right now and how we got there. Okay, so in the beginning, we were young and needed a driver and we didn't know too much, we had no documentation and so what we did was just loop over all the instructions and try to convert them into machine code as well as we could. Then as we learned more about how the hardware really works and so on, we wanted to use new features, we wanted to fix bugs that cause new complexity because you have interactions between emulating instruction and doing the swizzling emulation on the older chips for example. There was also the issue that initially we did the R300 and R500 fragment program entirely separately which was not a good way to live with that so we wanted to do co-chairing there. And so what ended up happening from a very high level point of view is that often there was a decision to take a single pass in this compiler and split it into simpler multiple passes that communicate using some intermediate representation which changed over the time and I guess that's actually the main philosophical change that took me personally quite some time to embrace is that to really embrace multiple passes. Also since last year when the gallium driver started to pick up speed there was a decision to share the compiler between the two to try to make it as independent as possible from the other things and just share those. So I talked about multi-pass kind of there is an explosion going on this is what we had initially this is roughly what we had at the end of 2008 and this is more or less what it looks like in master right now. So you see that first the single pass was kind of split up and first we do the emulate instructions and just replace them by native instructions in the assembly format. Then there was a strangely named not quite static single assignment and dead code elimination pass which also took care of doing swizzle emulation because there's a tricky thing about swizzle emulation the way MISA generates assembly is that you often have swizzles if you use only two components in an instruction you tend to get swizzles like X, Y, Y, Y that's not a native swizzle on R 300 but actually you don't need to use the third and fourth component you can just ignore that and so what this pass for the first time did was to analyze which components of the input operands are actually used and then mark the unused ones and take care of that in the swizzle emulation and then there was a separate scheduling in the fragment program scheduling these pairs of RGB and A as well as we could and then the emit and at that stage actually only the final emit was different between R 300 and R 500 all the rest was pretty much shared except for some instruction emulation details and again it's split to make things slightly easier and there's even a new pass where up here we use pretty much this assembly format as intermediate representation down here there is a new instruction format which is modeled after what the hardware actually does this split really is represented in form of a C structure there well what are the trade-offs of single pass versus multi pass so multi pass can be slower because there might be some information that you have to recompute several times however the advantages are really overwhelming because it's easier to wrap your head around one pass that does only a single thing instead of trying to do many things at once so it's hopefully a lot more understandable and maintainable now it's easier to share code because if you have a single pass that does something then maybe it applies to some other hardware as well and the compilation time doesn't matter that much because we only compile shaders at the start of an application usually and of course this slows down the start of applications and we shouldn't completely ignore it but it may be worth it because we don't have enough people working on this thing and having it easily maintainable is just much more important here's an example of how we can share passes right now we have common program compilation and vertex program compilation of course the final limit can't be shared but that code elimination is shared this is a pass that only is important for vertex program so it can't be shared this is something that we should share register allocation but we don't do it right now which is a bit sad if you look at there and instruction emulation everything that can be shared there is shared I mean neither is a subset of the other so it can't be shared entirely so that's very nice from a maintenance point now I think that to understand some program the best way to go is to try to understand the data structures and the most important data structure here is how do we represent the programs in the intermediate steps and that's actually a very simple representation it's just a doubly linked list of instruction structures and then instruction formats come into flavors there's the assembly style there's the one that goes into the fragment program hardware as I've already said we maintain a list of constants used because we need to add constants when we emulate the sine and cosine but that's it about this intermediate representation I really personally like the doubly linked list because it's very easy to insert modifying instructions which is something that we do a lot it's also easily understandable I think I really don't like TGSI for this kind of stuff there is one downside which is that to really do optimization like P poll or whatever we want to look at an instruction say okay this instruction writes some value now we want to know which other instructions use this written value and this is a query that with this representation in the worst case has to look at the entire program which is slow unfortunately I did experiment a little with trying to do a bit more clever data structures here which can yield an asymptotic speed which would be very nice the problem is that to make sure that all the invariants that you want to have that they are maintained is tricky and can easily lead to bugs because in theory you can do all the abstractions you want in C but somehow it's not very nice to express them this is something where where C++ tempts me because there you can express some abstractions more easily so when you go through a but that's what it finds in place and leads to new it modifies in place yes that's the problem there yeah there are several different approaches if you could try to fix it but that's what happens right now anyway can we talk about using the typical essence of how to find a place I thought about this the kind of problem is that you often have instructions that don't I mean your registers are vectors right and they don't actually there are many instructions that don't actually replace the whole vector but they kind of mix the original value with some new component that gets overwritten and I didn't really find a good way to deal with it I don't know if there is some literature on this kind of stuff but I did look at LVM it didn't seem like it was really I mean it was rather geared towards what you have in a usual CPU so it seemed rather problematic although for some of the newer GPUs it might be worth looking at it again I'm not sure if there is a real but it's the same representation and stuff like that don't go against the fact that we can use substructures or operations specifically it's what you meant even on more possible CPUs that you would not use well I guess in SSE the SSE representation is much more higher than that part so I'm not sure to be consistent with that but on the other hand you will have some nice things like being able to have dominators and where is the one you're talking about they're implicit I don't know maybe we can talk about this later about how you do this kind of thing in SSE that would be interesting okay another little detail is that at some point we want to do dynamic allocations so that we don't need to think about fixed size arrays and maintaining that is a bit painful so what we do is we have a memory pool structure which only has an allocation function it allocates whatever you need which makes it much easier yeah this is already kind of my overview of the compiler and now I want to talk a little bit about what we need to do for GLSL and what are the kind of the remaining things to get really get really good support here what is worth mentioning at this point explicitly is that actually most GLSL shaders today work just fine there are some features which are missing which is why it's a bit dodgy to claim to support GLSL flow control support and vertex programs isn't there yet supporting loops isn't there yet and there are some additional instructions I think that we would still need to emulate but I mean that's a small thing the other two things are a bit more a bit bigger and of course it would be nice to have a lot of optimizations okay how could we go about implementing loop support mapping the instructions onto the hardware machine code is actually pretty simple the real problem is that again it comes down to this data flow stuff because if you right now compile a program which has loops then the stuff like dead code elimination just doesn't understand that if you write a register here at the end of a loop and then read from it again at the top of the loop that there is a dependency which goes backwards there is code to support branches so that works fine I think but loops aren't supported yet and this is the harder part because some of that code is rather subtle and you have to be careful about what you modify where but I hope to get around to that soon optimizations are also an interesting problem so here is this GLSL program that I've shown at the very beginning and here is the assembly that the Mesa GLSL compiler produces which has 32 instructions if you do some very clever transformations you can get it down to 8 a bit more realistic goal which could still be manageable I think would be to go at least down to 16 or something here is an interesting kind of philosophical problem about how you do structure things on a high level because yeah it would be nice to if the hardware independent GLSL compiler already did some optimizations there are some optimizations that it could just do like that the problem is that actually well you don't know about the final hardware this compiler doesn't know about the final hardware especially in gallium which is a bit not nice so we can't well the thing is that for example as far as I understand the Intel hardware they are probably actually quite happy about this kind of stuff but we are less happy because scalers are placed pretty much randomly ignoring this RGBA versus alpha split so what do you do I mean do you go into the GLSL compiler do some optimizations would be nice to us but then piss off maybe some other hardware I don't know about Nouveau what their hardware looks like or Intel so right now we try to do everything in the driver which doesn't have the original GLSL it just has the assembly and tries to understand as well as it can one question even if you can't do optimization at this point why do you need more information for the hardware back end for example when it comes to circular plugins and on views actually tagging that in the initial format instead of having to analyze it the second time I guess it would be nice but I mean we already do this unused component analysis which is not very complicated as you said we had this throw step to figuring out the data dependencies and stuff like that you can already provide that every driver that does optimization will need that information that's true yeah I guess there would be value in pushing this unused marker into what Misa does and also what TGSI does yeah is anyone working on that or something like that I don't think so does anybody really have this really high level view what I know is I'm working on an LVM pipe project and put everything on LVM to be your presentation but still there's a lot even the translation to that there's a lot of extra stuff which could be already lived in a way but the thing is I really think TGSI itself is pretty good for doing this intermediate passes but if we could have modifiable either a double delay or a SSA representation of TGSI which could be some of your early passes which everyone on the driver can get at the end of TGSI I think that would be pretty useful for everybody in the tour that would be very nice but we're not doing that yet yeah that's a I agree lifting some of this optimization stuff into Mesa and then maybe augmenting the TGSI representation would be very useful then here are just some examples of the kind of thing you can do something that is a bit magic maybe actually works in R500 is here you have something that first multiplies to scalers and then subtracts them that's this kind of if you go back to the GLSL that's the CRS function here well it's a modified dot product really and maybe we could recognize that and do some magic which actually works in hardware in R500 to save some instructions this by the way is an example of why doing some of these optimizations as independent code is maybe not a good idea because if I get code like this on the R300 fragment program then I'll be pissed off because then I have to worry about all the swizzling here which is not supported in R300 but R500 can do it the shared option like a draw module you can opt to use it but in any drive you can use it but it's not that's true the question is do you want to do this before the gallium state tracker produces TGSI and you could still share it that's true I mean there are some optimizations that might be easier to do before the state tracker gets its hand on it because you might have still more information about where the code comes from from the GLSL I don't know how feasible that is I mean the GLSL compiler is I'm afraid of that so the GLSL compiler then there is some stuff that you can do with constant folding this is the greater equal comparison from the GLSL which has a zero constant there and we can do that more efficiently in the R500 fragment program by using some of the flow control features that it has how do we implement such data flow optimization the approach that wouldn't change this intermediate representation would be to have helper functions that help you figure out where values are used and where they come from and then add the optimization just as an additional compiler pass that does this one thing that you can, if you have some miscompilation you can just disable that compiler pass and see if it helps which is useful for debugging and then hopefully with the helper functions in place doing the actual optimization if we have an SSA based some representation then I guess this would look different but this is something that we work in the current intermediate representation model that we use ok and with that I go to the last part about co-chairing R600 and some other stuff as far as co-chairing is concerned well we've already seen a lot of examples that are rather hardware specific that you just cannot share I think there are still many things that could be shared and that would be nice if you could share them the real problem which I think also already appeared in the discussion is that to be able to share code we need to share data structures and we have to somehow agree on something that works well there and that's maybe for some future discussions R600 is interesting because it has the same processor for vertex fragment and geometry shaders there already is an assembler that I think works fairly well it doesn't do any optimizations however the processor is quite interesting because it has four separate aloos that are for the vector instructions but you can actually do different instructions on each component and then there is an additional fifth unit that you can also support these reciprocal sine, cosine these more esoteric instructions I think that GLSL that use a lot of scalars maps very well onto this model but there are some problematic operant selection restrictions that if you really want to use the hardware tools full potential you have again how do you do the instruction scheduling exactly do you maybe move some components of course we can't reuse anything that we did for the R300 because the split is just too different but again optimization passes would be nice to share okay now there is one slide on how to get involved in shader compilation stuff it is a bit scary I have to admit here's what you need to have before you do need to have some understanding of GLSL and of these assembly instructions otherwise there is no way to really wrap your head around this the best way to get this I think is to just hack on some toy applications or maybe if you want to have some new compass plugin or whatever that you want to work on that would be a nice way to learn it you definitely don't need to be a 3D expert you just need to understand how the GLSL works and the assembly of course as for all open source projects I mean pick something small as a first project maybe something nice would be if you have some really used shader from some open source game or from compass and you just look at what does the compilation result look like right now there are some debug flags that you can toggle to enable this output then you could look at the assembly that it generates and maybe you notice something that doesn't look look good that could be easily optimized and then try to optimize that and of course it's I think it's a learning by doing thing because there's really no book on the subject I think I mean there are some books on a general compiler design of course but I don't think there's anything about shader compilers in specific okay and one more thing about maybe thinking about how do we improve the way that we work because if you have better tools for your development then of course you don't have to worry about the small stuff as much one important thing is to keep the source document maintainable I mean everybody knows this I'm preaching to the choir here probably but I think the things that we did in the compiler by going to multi-pass and so on they helped a lot in that respect there is the question of maybe programming at a high level I know C++ is a touchy subject but sometimes I feel like it would be nice to have I've heard that some compilers they use some pattern based optimization stuff where you just if you have a multiplier followed by add then just combine it to one instruction and lots of patterns like that and maybe instead of writing a specific C code for each of these replacements maybe we could find something higher level that just describes these patterns and these transformations in some very high level language and then do some generation that produces code to do that there's some theory then when you modify something in the compiler it's very easy to break stuff especially some subtle swissling combinations and so on so it's good to have automatic testing so test test test right nice heuristic is that if there are no piglet regressions after you change something then probably you're fine I mean it's no guarantee of course but I think the test right now covers a lot of things that are typical bugs that are introduced again and again when you work on the compiler kind of crazy idea here to make the thing even more robust maybe we could you know generate shaders randomly and then just render using them and compare it to some software as well but maybe that would be an approach that helps us find more compilation bugs I haven't tried it to maybe something to hack on and I mean if you have some ideas of course it's always nice to share these insights and yeah I think that was faster than I thought I would be and I'm done so thank you for your attention so this means that there is no time for questions if there are some left