 All right. We're already way over time, but it's always better if there is interest in questions. I'm happy to answer them. All right. So this is a much more technical talk about how we are compiling shaders in Radeon SI which is effectively our OpenGL driver in Mesa. I want to touch on two topics. The first one is NIR and the other one is something dynamic linking which I'll explain. So NIR, those of you who were here in the morning, maybe saw Alejandro's talk about SpearV you've heard the term. Here's an overview on, wow, the colors are really off here for some reason, but okay. Here's an overview of how shaders are compiled in our various open source drivers. I think it's worth trying to follow that diagram properly. This is today's defaults. So on the left most path, you start with GLSL. It's translated into this Mesa GLSL IR, intermediate representation. This is co-chaired between all Mesa drivers. Then it gets translated into something called TGSI, which is a gallium specific for those who know about gallium translation. Then within the Radeon SI driver, this TGSI gets translated into LLVM, intermediate representation. Some of that is shared with LLVM pipe, but in the end the IR looks quite different. Then this is passed to LLVM, which does the back-end compile and generates the actual binary that is then uploaded onto the GPU. Now to go a little bit further to the right, if you're running Vulkan with RadeV, you will start with SpearV. You will take this middle arrow to NIR. This is the translation that was first introduced with Intel's Vulkan driver, and which is now also used by RadeV. Then the NIR gets translated into LLVM IR, that looks like the LLVM IR that Radeon SI produces and gets translated into binary. On the right-most side, if you're running Vulkan with AMD's official driver, then the SpearV gets translated directly into something that I like to call LLPC IR. LLPC stands for the LLVM Pipeline Compiler that the driver has, and it's really just LLVM IR. But certain graphic specific functions, like buffer loads, image sampling, etc., are expressed via LLPC-specific function calls that act like intrinsics. They're not really LLVM intrinsics, but they act like this, and then still as part of that driver that gets lowered into proper LLVM IR, that doesn't have these special intrinsics anymore and gets translated into binaries. You see that the back-end part is shared between all the drivers, but the front-end has lots of different paths. The goal of transitioning to NIR is to have the picture look like this instead. So to have RAID and SI take the path of going from Gillesil IR to NIR, and then proceed like the Rad-V driver does, the NIR translation is of course already in place for Intel's OpenGL driver and also various Scallium drivers use it. Right. So have a bit of more sharing. So why do we do this? So one is reduce the code duplication. Another big reason you've heard about in Alejandro's talk this morning is to support the Spear-V features in OpenGL as well. It's just the least resistance part to do that. Also some future features that we want to implement, it's a bit more convenient to represent them in NIR than to add them to the TGSI path, like half loads and 16-bit integers and stuff like that. Also NIR is actually a representation that you can use to do code transforms on it. TGSI is really just a representation to transfer shaders and you can't easily transform it. With NIR you can do this which allows more opportunities for controlling how optimizations are done in the shader backend, maybe some hardware specific optimizations that may become what I mean by this, may become clear later on in the talk. So I don't want to talk much about how this transition is happening. It's actually very far along already. So thanks to a lot of hard work by some people, it's very close to feature parity with the default TGSI path. We've seen recently that maybe the performance isn't up to par yet. We need to work on that. But we're quite far along and if you have a recent Mesa master driver, you can test it out by setting this environment variable that is mentioned there. So definitely Kudos need to go out to Dave and Baz who wrote the initial translation from NIR to LVM as part of RADV. Then after I did the initial thing as part of my Svirvi experimentation, Timothy has done a lot of work to get us to close the feature parity. Samuel has also done some good work on the NIR backend. So thanks to all these people, we're getting very close. Maybe one question that comes up in this context is, well, what's the future of TGSI going to be, of this other old shader representation? I don't think it's really going to go away very quickly because there are various niche places that generate TGSI. I think there are some in the multimedia. I'm actually not sure right now. There are some helper libraries that generate TGSI shaders. There is the nine project, the D3D9 implementation for wine which is TGSI based. For all of these, the first thing that we can do is just keep the TGSI backend around longer. I mean, that's perfectly fine for now. The other part that we might at some point consider is to use this translation of TGSI to NIR which already exists and might help there. TGSI is also currently used as a shader transport for virtualization drivers. So both VMwares driver as well as the Virgil driver. So I don't know what their plans for the future are. There is now a binary encoding of NIR for the disk cache, which should be suitable to fill in that function of TGSI. So they might want to consider migrating at one point as well, but I really have no idea what their plans are if they have even thought about that. Given the lack of time, this was really everything I wanted to say about the NIR part of my talk. So if there are questions about that, maybe I'm bringing them up already now. Otherwise, I would just continue. Okay. So the second part, I wanted to talk about concerns of with dynamically linking shaders. So what do I mean by that and why do we want it? Or why could it be useful? This is really an aspirational talk where there isn't code written yet. It's kind of a goal to talk about and get feedback for. So LVM, what it gives back to the driver is a standard ELF object that contains the shader binary. It contains some GPU-specific data sections, but mostly it's a standard ELF object. Right now, what we're doing is we're just taking the code parts or the text section out of it, and we'll actually just paste together text sections of multiple shader parts. The goal would be that instead of this ad hoc pasting, we'll do a real linking step that can also take other sections into account and that can resolve relocations and all sorts of things like that. The main motivations for that are that doing so would allow us to really have a better treatment of read-only data. So right now, if you have a constant in your shader, like if it's just a scalar value, then it will become an immediate as part of the instruction stream like on x86. On x86, you have instructions with immediate constants and the same on our GPUs. But if you have a larger constant structure, so maybe some hard-coded lookup table that the shader uses, then in a normal program on a CPU, that would land in this read-only data section. But since we don't have that yet, the only ways we can deal with it are either to translate it into uniforms and treat them like uniforms that happen to be unchangeable by the program, or we generate code which has them all as immediate constants and build the table on the fly while the shader is running, and neither of these solutions are particularly great. So it would be good to have proper handling of these read-only data sections. The other aspect is that it would or should allow us to explicitly describe what the hell we're doing with LDS. I will explain what LDS is local data share. People who write compute shaders know about that. It's just called maybe differently. I'll explain the details of that. Maybe you're asking yourself right now, okay, why do we have multiple shader parts that we want to paste together in the first place? If you program OpenGL, you might be thinking of shader objects that are being linked together in a program, but that's not it. So in OpenGL, you can have, for example, multiple vertex program shader objects that are being linked together into a single program. But this linking happens long before we ever convert it to LLVMIR. So I'm not talking about this kind of linking. What I am talking about can maybe be illustrated with a small example. So this is one of the simplest possible pixel shaders, which I extracted by running this GLX Gears command. This is just the assembly that we output. I mean, no, I'm biased, of course, but I think of all the desktop GPUs, we have the nicest assembly. To make sense of it, I should maybe tell you a couple of things. So first of all, our ISA is honest about the fact that multiple threads are being run simultaneously within the same, what we call wave. So the program is running a wave, and each wave consists like a single instruction multiple data machine of 64 parallel threads. So there are scalar instructions like the very first one. It starts with an S like scalar, which is operating on a scalar value. So it copies a single 32-bit value from S9, which is the scalar register number nine, to the special register M0. Now, why it would do that is maybe a bit mysterious. S9 happens to be pre-initialized by the hardware. You can think of it as a shader ABI that's going on, and M0 is a special register that will be implicitly used by the next instructions. So the next instructions, they start with a V like vector. So they actually operate on up to 64 pixels at the same time, and they are interpolation instructions, which is made a bit of a misnomer because there's no mathematical interpolation going on. What they're doing is they're just taking a constant attribute value. So you see the P0 after 0.x, which basically tells take the x component of the zero index attribute and store its value in V0. Then that's that for x, x, y, z, v. It's a W. Then you have vector instructions that convert, packing 32-bit floating point numbers as half floats, by rounding to zero RTZ, taking values from the V0, so the first vector register, packing them together, storing the result in V0. Then one after that, taking values from V2 and V3, packing it into V1. Then finally, there is a special instruction called an export instruction, which will export this color data to the color buffer, which then writes it into the render target, possibly performing blending. Finally, a scalar instruction that says end program is the end of the shader. So there are these kind of, some of my animations are gone for some reason. Well, the main message here is, why are we doing this compaction to 16-bit floats? We're doing that because we have an 8-bit color buffer and exporting that way is faster, but that depends on what the color buffer happens to be. So the first part of the shader is really only dependent on the original input shader, so GLSL source or maybe some legacy OpenGL stuff. Whereas the bottom part, starting with the conversion instruction, is something that we can only generate once we know what the color target is going to be, because if the program wants to render into a 32-bit floating point buffer, then it would be incorrect to pack into a half-loads. We would lose precision. So this last part, we can only compile once we know how the shader in which context it's going to be used. This leads to a problem called stuttering where maybe you have a game running, and then an object appears on the screen where the program had previously compiled the shader, but now the shader is being used in a context that is not being prepared for, and it needs to be recompiled, and whoop, your scene stops for a moment. The way we solve, one way to solve it is a disk cache, of course, but that only works once the program has run, at least once. For the first run, what we can do is we compile this main shader part, and then initially, and then the prologue and epilogue, they are very short and they can be compiled quickly and then just paste it together. So that's why we combine different shader parts. There is another reason why we combine shader parts, and this has to do with how the shader stages work. So if you know OpenGL or Vulkan or anything like that, you should be familiar with the column on the left. It shows the shader stages in a graphics pipeline. It starts with a vertex shader, then if you want to use tessellation, you have optional tessellation control and evaluation shaders. If you want to use the geometry stage, you have an optional geometry shader, and in the end, you have a pixel shader that produces the pixel color values. Now in our hardware, the stages that we have actually look like the next column. So at the bottom, you see vertex shader and pixel shader, and above you see like geometry shader is familiar, but you see these other names that are kind of weird. And so what happens is that the API shader gets mapped to the proper hardware stage as illustrated by the next columns. If you have the simplest and standard case of just vertex and pixel shaders, then the vertex shader goes to the hardware vertex shader, the pixel shader goes to the hardware pixel shader, as it always does. But if you use the geometry shader, then the geometry shader goes to the hardware geometry stage, but the hardware vertex stage is after the hardware geometry stage, so we have to change the order. I mean, it ends up in the ES slot, and there is something called a copy shader, but the details are not important. A similar thing happens when tessellation is used, and in the most complex case where both tessellation and geometry shaders are used, you get the column on the right. And now what happened is that the hardware designer said, well, like in the tessellation case, between the vertex shader and the tessellation control shader, there isn't actually that much fixed function stuff going on there, so let's just combine these into a single hardware stage. A single hardware stage that first runs the vertex shader, and then runs the tessellation control shader, or the same in a second column with the vertex shader and geometry shader in one single hardware stage, one single program from the GPU perspective, you run both vertex and geometry shaders. So we have to paste also these together. Okay. So this leads to some interesting challenges. If you know a little bit about this stuff, you'll know that the vertex shader, it just outputs attributes for an individual vertex. While the geometry shader, it operates on one primitive at a time, so the input of the geometry shader will be an entire triangle with all the attributes of its vertices. That is the view that you have as the programmer. So if you think about how to translate that into this single instruction multiple data machine on a GPU when everything runs in a single shader, you first kind of work with vertex threads and vertex lanes where every lane is responsible for computing one vertex shader invocation for one vertex. And then in the second part, you will have physically the same lanes, but now they're operating logically on primitives, on geometry shader invocations. And you somehow need to transfer the data between them. And the way that this data happens is that the vertex shader part stores its output into the local data share, which is a small memory that is shared between all the waves within one work group. So typically we have up to four waves to process up to 256 vertices at the same time. And then the geometry shader part loads the inputs from there. Now, this kind of shows how the data is laid out in LDS, but what happens, the main problem here is that LVM does not know how we're using LDS. This means that we cannot use LDS for all sorts of things where it might be interesting to use it for like spilling. So spilling, I mean, it doesn't have such a big application because for every 256 kilobytes of vector memory, we have 64 kilobytes of LDS, but still maybe there are some cases where it could still help. We can't use it for dynamically indexed arrays where it sometimes might be helpful. It's difficult to use LDS for additional purposes even from the front end because the front end has to keep track of all the addresses manually and it just becomes complicated. It might inhibit LDS analysis in some cases, although LVM is generally very good at that. So the goal would be to somehow explicitly represent all the variables that we use in LDS, store them, represent them as an LDS segment in the ELF object that we get, and then when we merge together a vertex and geometry shader, they will have a shared variable which will be where they transfer the attributes, but maybe they have some other uses for LDS as well and we somehow use the linker to arrange those and calculate the right addresses. It's not entirely simple because if you look back, this thing here is kind of a two-dimensional array. So one index is the vertex number and the other index is the attribute number and component, and we don't really know either size when we compile at least the geometry shader part because we might have a vertex shader that produces attributes that are unused by the geometry shader and we don't actually know how many waves we're going to run simultaneously in advance, so there are some problems, but as a kind of a minimal demonstration that might already be useful for various things, it would be nice if at least we could represent additional LDS variables. Maybe we want to use it to store some dynamically indexed array that happens to be the same across all lanes, something like that. Maybe use it for spilling, maybe use it for something else and do that in the linking. So that would be the goal. Remember, I also mentioned read-only data linking. That part is fairly straightforward in comparison because it's just like on CPUs. We just need to think about what we want the ABI to look like. 64-bit pointers maybe, maybe we want to restrict ourselves to 32-bit address spaces for a bit of efficiency. There are some good choices to be made there. So of these two options, I think the second one is the better one but that goes into details. So yeah, just let me summarize the two points. So switching to nir and radian SI, we're going to do that and it's actually very far along already. And the other part is, well, aspirational. I want to explore this dynamic linking. I've explained to you the main purposes of it and what is involved. Interesting question in that is what kind of linker do we actually use? LOD, part of the LLVM project is kind of a natural choice because we already depend on LLVM although it does live in a different repository. It can be embedded as a library. It's designed like that. On the other hand, it's like a complete ahead of time linker and we only really need a dynamic linker. Well, this still needs to be explored and we'll see. Okay, with that, thank you for your attention.