 Yeah, so I'm here to talk today about how we've spent the last year or so building a brand new compiler inside of our OpenGL libraries. This comes as a bit of a surprise to a lot of people. They didn't know that OpenGL actually involved compiling. But it's actually most of what we do these days. So GLSL is this C-like language. For 3D modeling, you have a collection of vertex data and a bunch of operations you want to do on it to translate it into space and rotate the arm of the model and all of this sort of stuff. So you want to do a bunch of operations on vectors. These operations on vectors you want to have programmable depending on what your application actually wants to do. You want to have some sort of fixed thing that says you can only translate and scale and rotate. So how this works is that the application actually hands a source string to the OpenGL library that consists of a small program to do these operations. It gets some data in at the top that you declare and it outputs vertex position, texture coordinates, all of those things. So that's the vertex transformation side. There's also the fragment shading where once the graphics hardware has figured out where your triangle is on the screen, it actually calculates for each of these little pixels what all of your parameters were by interpolating them between the vertices. So you get this little program instance that gets a bunch of these parameters coming in at the top and you have to do a bunch of math and decide what your color is coming out the other end. So what does it look like? Here are some very, very simple programs. The sorts of things that our compiler used to be good at. So at the top left you see we have a matrix, 4 by 4 matrix with the model view projection that you multiply by the vertex position and you get the position in eye space, for example. At the bottom is a fragment shader for that where we just assign a constant color to it. On the right hand side we extend that a little bit, we add some texture coordinates in. So the vertex portion on the upper right instead does the model view projection still and passes through the texture coordinates as a two component vector. The bottom right we show a sample program that does texturing where we take those texture coordinates, look it up in a rectangle texture and pull that out and set that to our color. These are all very simple. These are the sorts of things that we handled well before. But the problem is that these days programs look more like this. We need something a little bit better to handle giant reams of code that people throw at us. Before we had a compiler that looked at your code, made a syntax tree and dumped it right into the existing intermediate representation that the drivers consumed. But it turns out that now we need actual optimization and it's hard to do in that form. The good news is that a lot of compiler stuff is easy for us. Compiler techniques are extremely well known. When we started this project the first thing we did is we went and bought some textbooks. Also Lex and Yak handle a lot of the really irritating parts. We're very glad to have those. Another thing working in our favor is that these programs, even that nasty looking program there, they're really short. So we can throw almost arbitrary amounts of CPU effort at optimizing these programs and nobody's going to notice. The language has no idea of memory. There's no pointers that pointed anything. We don't have to worry about aliasing. There's just these vector data types. There are some downsides though. GPUs don't really look like CPUs. We always work on vectors pretty much. There are floats in the language but generally on most GPUs those floats are in a VEC4 register. We have write masks on these registers so you can update just one component of your register which makes a lot of optimization hard. Also GPUs don't have general flow control. So write masks are one of the big things that make GPU optimization different from the way that you do compilers for CPUs. In all of these optimization passes you want to know where did the value that I'm looking at come from. So you want to look up in the program and see where was it defined. Well, and that's easy for scalers. You look at the last thing that wrote to it. For a vector though where you can update one component such as this shader right here, what does that mean? I have a color that came from a texture and then I updated the red, green and blue with a mixture of some gray color because I wanted to desaturate. This shader actually came from GNOME shell and then I set my result to that. So if I wanted to copy propagation into my result where did the color come from? Was it the first instruction, the second instruction, or both? And it turns out that there's actually two different answers depending on how your GPU is built. The way that many GPUs are is called array of structures, AOS mode. And this is where those VEC4s are packed into your registers approximately like you would think where you have XYZW and you put it in the register. And that has that data dependency issue that I just mentioned where a portion of that register got updated in one instruction and the other portion came from the definition before that. There's also a structure of arrays mode where instead of packing all of your VEC4s in one register, you instead have four separate registers and one for each component. And then your register contains multiple program flows in each of its channels. This is really nice for those channel updates just before. Now each channel update is one instruction and all of your data dependencies are easy. It's look at what updated my X channel, look at what updated my Y channel. And yes, there is quite a mix of the graphics drivers we have. The 965 changed to doing SOA instead of AOS. NVIDIA recently as of the generation before the current one changed to SOA for both vertex and fragment processing. So our compiler needs to be able to do everything both ways. Other joys of working with GPUs, they don't have flow control. You can't jump or rather you can jump but it's going to break your program because when you have everything packed so that eight different program flows are running in my register, what happens when half of them decide to jump and half of them decide not to? It's not going to work. So as of about six years ago, there was no flow control at all. No jumps, no ifs, no loops, nothing. You got a straight through execution from top to bottom of your program. Unfortunately GL cell requires ifs and loops because that's actually how people want to program. So GPUs now can do that. They can do if statements where half of your fragments take one path and only update those components for one half and then you hit else and then you switch to updating the other half of your register for the other execution path. You still don't get jumps. So on older hardware such as the 915 that we were trying to support, this meant we didn't have flow control at all. So the way we ended up handling this is that we take all of the if block and say do a conditional move of the body of the if block depending on the if condition. So every instruction gets executed for all your fragments. Array access. Another thing that GLSL added that the GPUs are pretty bad at. I said that we don't have memory and pointers but we can declare a temporary that has a size and you can do array access on it. A lot of GPUs simply don't do this. The 915 doesn't, R200 doesn't do that, I don't think. And yet we have to support this. So we used a little trick again using conditional moves where you look at the array index that was chosen and for each element of the array if the array index is equal to that do the store. And yet on this hardware where we're doing horrible things like exploding all of our if statements into a bunch of conditional moves and splitting out our array access into a bunch of moves per element of the array. We also have limitations on instruction count. Back on the 915 we only had 64 math instructions that you could execute in your program. 32 texture instructions but they had to be ordered in a very specific way. The R200 was a little more lenient. 128 instructions. The next generation 256 instructions. They're really limiting. If you're bad at code gen, you're not going to be able to fit on the hardware. You're going to fall back to software at best. Or simply tell the user no your program is too big. I'm not going to do that for you. Which GLSL actually allows you to do. You can say failed linking. Unacceptable program. Try with something simpler. And on these instruction limited pieces of hardware, registers are also quite limited. We had on 915 16 temporary VEC4 registers. That's all the data you could work on at all. There's no memory access like I mentioned so I can't spill registers. If I don't fit in 16 VEC4s, I have to tell the user no I can't execute your program. On Radeon, things were getting a little better. And even on the 965 where we have 128 registers, each of which is 8 floats. We still have a lot of issues with register allocation. Because with 128 floats almost all programs fit into that. So they didn't spend a whole lot of time on making the memory access path really fast. So if you ever do spill registers, for example, we had a bug that we hit in Lightsmark where an optimization led to register spilling. One shader out of 10 or so that are in use reduced overall performance 50% by register spilling. So register allocation is quite important to us. There are a few things that are easier after all of this complaining about how GPUs are awful. OpenGL has very lax suggestions about how math should work. The actual text is approximately one part in 10 to the fifth. Please. So we can do almost arbitrary things to your math. Which is quite nice because often people will do things, you know, ask for one over one over X and we just say, well, take the X variable. Things where you multiply by a constant do a bunch of multiplies by variables and multiply by another constant. I can look at that and say, well, I'll just take your constant over here and move it over here in my expression tree and just constant fold them. And hard math things like the sign instruction, the sign function will just approximate those with a small polynomial and it'll take four instructions instead of 60. So we went and did all this. Mesa 7.9 released in the fall gave us GLSL support on 915 which had all these limitations. You know, it can't do flow control so we had to lower all of our loops and lower all of our if statements. It has limited register space. It has no ability to do array access and it actually does these things now. So a whole class of applications can run on this hardware that wasn't able to before. In Mesa 7.10 we added native code generation for the 965. With this we got major performance improvements. No longer were we merging our GLSL into this Mesa IR that was very hard to optimize and then trying to turn that IR into our actual native code. We actually worked directly from the GLSL IR, call into our GLSL compiler to do a bunch of optimization specific to our hardware and generate code out of that. Nexu is an open source first person shooter game that I've never actually played. I just run the demo all the time. Got 20% faster overall from all of this work. So, you know, I think this is showing the results that we can really speed up OpenGL with the new compiler work. From looking at a lot of programs with the new compiler output, we're getting pretty good looking assembly out the other end which used to never be the case. I'm hitting the point where I look at programs and I can't see a way to improve them obviously. We still have work to do on some programs. I was looking at some line applications the other day and realizing that I had never finished writing an optimization path for the mid-level IR that I thought I had done. So a lot of work to do and optimization still and a lot of room for people to jump in and work on optimization. It turns out that this stuff is not really hard. You can look at the output of your GPU and, you know, see the IR that was used to produce it and say, well, you know, there's this extra move here. Why is that? Can I come up with something that would fix that and go generate a pass to do that? Generally, there are optimization passes or about 200 to 500 lines of code. So they're, you know, weekend projects more or less. We still need native code gen for other GPUs. In 965, we've only done native code gen for half of it. The 915 has no native code gen yet. R200 doesn't have it. All of the gallium stuff still goes through the Mesa IR before hitting TGSI, before hitting your driver's IR, before hitting your actual hardware. So there's a lot of room for cutting a bunch of stuff out of our stack if we do a bit of work on our other drivers. And we still need native code gen for the CPU. Right now, if your hardware doesn't do vertex or fragment processing, we compile it into the Mesa IR and then we execute that Mesa IR with a CPU side interpreter. It's rather spectacularly slow generally on the seconds per frame order or minutes per frame. So we're looking at LLVM for that hopefully. There's some code to work from on the gallium side, but we think that it ought to be done a little differently. And the LLVM guys seem to agree with us. They said basically, don't emit SSE code at us, just emit vector code at us. And if we don't generate good SSE code, come talk to us please. I think that's about it. Any questions? We have a lot of time for questions, so hopefully there's a lot of them. I'm kind of new to shader languages. How much of this applies to GLES2? So the OpenGL Shading Language has many, many revisions. Right now Mesa supports 1.0, which is the GLES version, 1.1 and 1.2. We're in the midst of working on 1.3, which adds a bunch more features. So right now you can use the GLES 2.0 drivers that come out of Mesa, which implement the GLES Shading Language. Or you can use desktop GL with the desktop GL Shading Language. Or we've just added an extension recently where with your desktop GL you can use the ES Shading Language to offer a way for people to port applications. Notably this is really important for WebGL. WebGL uses the GLES Shading Language and yet they want to target people's desktop GL drivers. So with this extension they don't have to do a bunch of processing on your shader program to try to turn the string that they were handed into a string that fits into your language. You mentioned a bunch of things that GLSL implements that are difficult to optimize. Has anybody been working on other shading languages for writing shaders? Everybody has all these requirements now. Dx is the obvious one to compare to. Really it's that the shading languages of the time are defined according to what the hardware of the time can do. They're generally not pushing the hardware. It's, oh look, the hardware can do this now. We can go implement that. But from an open source environment perspective we would really like people to be able to use the latest shading language even on their older hardware. For example WebGL on your 915. The 915 can't do general GLSL but it can do just about everything. So that's why we went to so much work trying to lower your programs into something that can actually execute on the hardware. Have you encountered any developer resistance you're wanting to do with the way with intermediate IRs? Yeah there has been a lot. So the problem is that we've built this new IR. This entire compiler was built in C++ which was a bit contentious within the team and outside of the team too. So the IR, the way you interact with it is in C++ and it's something that's very new and different from what we had before. The previous IR looked like assembly. It actually mapped very well to our fragment program which was built as this is what one generation of hardware can do. And you just write the assembly for that hardware and everybody else gets to try to implement somebody's hardware assembly on their other hardware. So the current IR that most drivers take in is that old our fragment program assembly. Now with our expression trees and conditional assignments and all of this stuff, things look quite different. And so that new IR hasn't been plumbed down into gallium drivers for example. Gallium takes our IR, turns it into Mesa IR, turns Mesa IR into TGSI so that TGSI can actually talk to the drivers. I think we should start to see more traction on getting this plumbed more directly into gallium hopefully. If I was to be interested in gallium I would need to see this IR down into the driver because writing driver code gen from the new IR is so much better than the old stuff. Notably it has things like types which the previous IR only had floating point vectors which you tried to store your integer value in. So there's a lot of things that are better now and we would need to get that plumbed in for that to be interesting. So you said that shader language is somewhat an artifact of the GPUs of the time. Are you trying to make this implementation likely to scale as new features hit hardware? So right now we're still playing catch up. We're on GLSL 1.20. GLSL 1.30 was released years ago. The current version is 4.10. So there have been four minor versions or more that we need to catch up with. So we can look ahead and see where things are going and we're trying to architect it so that things won't be too hard. But yeah, we don't really know that much of what's in the future. We haven't implemented what we need to do today yet. More questions. There's an open source graphics engine or game engine that I sort of follow and sort of maintain called Xreel. It's based on Quake 3 and it's sort of an OpenGL4 API that uses that. And I'm sort of not the main developer working on the renderer. He's a guy from Germany, but he found that he got a massive speed up when he wrote some extra code to pass the GLSL shaders and sort of break them up into multiple shaders. So where you have the if statements in the code, instead of leaving it to the driver to optimize that, he'd break it up into two different shaders, one to sort of handle either side of the path. And yeah, that got a massive speed up on ATI in particular. So is that sort of a technique you've looked at? Yeah, that is actually something that's been the case before. My demo that I've used for a bunch of benchmarking, I actually had a hack in there to remove my if statement that was designed to skip a whole bunch of code if you're entirely in shadow and not lit at all. Massive performance improvements originally from disabling my if statement. Now with the new compiler, disabling the if statement doesn't help anymore because we actually have good code gen at the back end that doesn't throw its hands in the air when it sees an if block. That's what we did before. It's like, oh no, a new basic block. I can't handle those in my optimization. Oh well. So things are much better these days. Hopefully that has changed. It would be nice to get some of those. We have a big problem in testing the compiler that we don't actually want to run every game in the world for every change. And when you actually run every game in the world, you have to watch the entire screen to see if some pixels are flickering over here. It would be nice if we could to get some of those shaders like out of Xreel into our test suite. Because then in addition to regression testing, we can actually run the shaders, look at the output, try to improve the performance of them and see if they still render the same thing. So I would love to talk about how to get that into the Piglet Test Suite if possible. Sorry, back to intermediate representations. If you're only looking for a half day project rather than a weekend project for all of those gallium drivers, I think it would be any point in trying to cut out just the one shader step, say cut out using MECIR and keep in TGSI. Would that just be a waste of effort later on? I'm not sure. Part of my avoidance of the gallium stuff is that if I wanted to change that intermediate representation that all the drivers accept, I don't know how that impacts all of the closed source products that are built on top of gallium. And I just don't want to deal with that because I don't have any drivers in gallium. But yes, I think there could be big wins for the gallium drivers from switching from generating GLSL IR to MESA IR to TGSI to going from GLSL IR directly to TGSI. Notably, as far as I know, TGSI has sizes on its variables, which means that array access doesn't necessarily break all of your optimization on your registers. These intermediate representations, the MESA IR and the TGSI generally operate on a sort of virtual machines register space. So they don't reflect your actual hardware. They're something else that you interpret. By having sizes on those temporary variables, you know when you have an array access that you're only accessing those and you can still do copy propagation and dead code elimination on the other instructions that don't access those variables. So yes, I think that could be a big win for Code Gen on Gallium. Something I see in my code is bad performance with if I'm trying to draw into different FBOs and I've got to switch model view matrices under the hood, that's binding four by four matrices into a uniform that become your model view and projection matrices on the shader side. Is there optimizations that I should be excited about that are going to make those uniforms better? So that actually is to a significant extent a hardware problem. For example, my hardware here can only have four outstanding constant buffer changes at a time in its pipeline. So if you're constantly changing your projection matrix doing a bit of drawing, it will end up bound up doing that with a bunch of its units sitting there idle because something is blocked waiting for somebody to finish with an old set of constants before a new set of constants and rendering can come in. A common way to deal with this problem is to just do the multiply for the model view projection on your own on the CPU side. It might be cheaper than handing it off to the GPU and having the GPU try to do those few multiplies, particularly if you're rendering just some small quads or something per state change. There's not a whole lot we can do at the compiler side to improve that because it really is a hardware limitation as far as I can tell. Any more questions? We have a time for another five or so. The next talk is not till 11.30. One more in the back. So how much of Misa do you think now has been directly contributed by Intel Devs? It's hard to say. So the compiler is about 50,000 lines of code at this point and then our driver is about 50,000 for 9.65. So the compiler we wrote mostly ourselves, the 9.65 driver at this point, I think we've changed most of the lines in it from when it was originally written. Our fragment program parsers written by us. It's hard to say. I haven't done the statistics on this, but I think it would be kind of fun to. But I am pretty excited to see all of the developments going on at Misa these days. We have teams working on Radeon and NVIDIA that are actually doing the driver side code generation infrastructure that we've needed for so long. We can't just take the incoming IR and generate the obvious code out of that. On your GPU back inside, you really do need to do a bunch of optimization there that won't be visible at the mid-level IR. So people are actually tackling that these days. And what I'm hoping is that we can start to pull some of those people to work on the mid-level IR for things that they were doing inside of their driver that can be done at the mid-level now. People were doing their own custom loop unrolling. People were doing their own custom array access lowering. All this stuff that really we can do at the mid-level now. Intel's got some new graphics hardware coming out with the new Sandy Bridge platform. Is this going to support that as well? Are you sort of being able to keep up with the new graphics hardware as it's coming out? So we actually were building the Sandy Bridge support in Mesa as we were building this compiler. The two projects were working in parallel all in the open. So people were testing Sandy Bridge before the hardware was released on, you know, pre-release SDVs right out of the Mesa tree. So I think things are going pretty well there in terms of new hardware support. For the code gen, there are some things that we're not quite taking advantage of yet. There's some new instructions that collapse other instructions. There's some new errata that we need to work around that we have somewhat inefficient workarounds right now. There's always errata in GPU hardware. All of it, you know, there's always a huge list of things you need to do to make it actually render correctly. And we just need to polish some of the new stuff. Do you have any issues when fixing bugs that result in a change in visual rendering way you get complaints, because now you're rendering correctly? I don't think I've seen that one yet. Mostly it's, you know, implement some new optimization that speeds up a bunch of programs by 5% or so. And, oh, now the latest example was Civilization 4 started rendering wrong again. It's like, oh, no. We really need a lot more work on our shader test cases. We're up to something like a thousand rendering tests in our continuous regression test suite that all of us developers use on every change, but that's still not enough. It's hard to generate a lot of the big programs that are going to trigger bugs and optimization that you can render in a small demo program that will just print out a solid color that you can test easily. Coming up with those is really painful. That's something we definitely need to work on. It sounds like a relatively simple language. You could kind of do random code generation to throw at it to test what happens as well. We do that with SQL, and that's a brilliant way to crash things horribly. Yeah, I'm so afraid of somebody doing WebGL fuzz testing, because, I mean, with WebGL now, people are generating source strings to hand into my compiler that, I mean, you know, we try to be careful at all, but we haven't tested this, and you're going to run arbitrary code off of the Web to compile it on my machine, and this is going to work. Apparently it works shockingly well so far in the WebGL testing that they have today. But yes, we need some fuzzing because we're going to find so much, I'm sure. Questions? So you still say you need to do native code gen for the CPU. With the Larabee project, I guess that Cayman sort of went as not quite ready. Has any work been done? Was that because, I mean, they're pretty much general purpose CPUs. So which project? The Larabee project, which was Intel's Mini Core x86 that they were going to aim at sort of graphics rendering. Yeah, I'm not sure there would be anything out of them to take advantage of much, because, you know, their IR in their OpenGL library wouldn't match ours and their target instruction set wouldn't match ours. That said, somebody just joined our team from the Larabee project, so it's fun to talk to them about some of the similar problems they've faced, what we face in terms of things like when you go to access a texture, you're not going to get results for a very, very long time. You need to work on scheduling all of your instructions into that enormous delay slot you've got. So we've done some of that work recently. Any last questions? I'll assume we're all out of questions then.