 Hi, my name is Nikolai. I work at AMD on our open source OpenGL driver, where we are using LLVM as the back end for shader compilation. And I want to just very briefly go over some unusual semantics that we need to model in LLVM IR due to the way parallelism works in GPUs. So just kind of as a reminder, how does it work? On the GPU, you have many of what we are calling compute units. You can think of those as kind of like cores on a traditional CPU, except that each of them can have many waves in flight at the same time. Waves are kind of like threads on a usual CPU in the sense that they have one program counter. But they are computing up to 64 items or what you would think of as threads when you're actually writing the source code for your shader or a compute kernel. So in hardware, you actually have scalar registers with one item per wave, one value per wave, and vector registers with one value per item. In OpenCL, OpenGL, if you know it, work groups are assigned to compute units, but might be split up into many waves. So what are some of the examples that I want to mention? There are barrier instructions or calls. There are screen space derivatives and pixel shaders. And the fact that texture lookups use descriptors that are stored in scalar registers. The barrier instruction is a very simple instruction that says, please wait for all other waves in the same work group to reach the same barrier instruction in the program. So if you go from the left code snippet to the right one, it looks like it would usually be an OK transform, but it isn't because one wave might go down the true path while another goes down the false path. And then they're indefinitely stuck waiting for each other. So we need to tell LLVM not to do this or similar types of transforms. And we can do that by saying that the barrier intrinsic is a convergent function. The convergent function attribute has a formal definition. The key part is cited here. You can meditate on it, but basically it does exactly what we need. The second example are screen space derivatives. So the typical use case for those would be when you're sampling a texture in a pixel shader, you want to take the derivative of your texture coordinates to be able to select the level of detail at which you're sampling the texture. Now, again, here, when you go from left to right, it looks like a perfectly valid syncing transform, but it's actually not because the way that derivatives are approximated in practice is that you look at the corresponding value of the thread that computes a neighboring pixel. Now, if the neighboring pixel happens to not go down the true path, it doesn't compute the coordinate in the right fragment, and you're getting an undefined value. So we need to teach LLVM not to do this kind of transform. And we can actually also do this by marking the texture intrinsic as a convergent function. So these two examples, they're covered in LLVM today. The last example that I want to mention is one where we're still having a gap. Again, look at the top two code examples, left and right. They look like they should be kind of equivalent, assuming that there is no nasty side effects in the texture intrinsic. The problem, though, is that the sampler variable, which tells you which texture to sample from, is physically in the hardware stored in one of these scalar registers, which means that all the threads within a single wave must be sampling from the same texture. And the code fragment on the right is therefore bad if the condition is not the same for all these threads. So again, the transform from left to right is one that we must forbid and actually one that LLVM does today, at least with tail-syncing in Simplify CFG. It gets worse if you look at the bottom example from left to right, which looks like it should be trivially equivalent, but it's not because maybe if condition is true, then sampler is sampler 0. And if condition is false, then sampler is sampler 1. And then the code fragment on the left works with the hardware, but the one on the right has the same problem, where suddenly the sampler is not the same across all threads of a wave. And you get some kind of undefined results. So as I said, this is something where we still need to figure out how to solve it. How are we going to do this? Well, since in the texturing example, the sampler parameter is affected, but of course, the texture coordinate parameter is not affected, this is really something that wants to be modeled as a function parameter attribute. It's difficult to model as such because it's a constraint that does not map very well onto the usual SSA semantics. But I think we have some language that is pretty good. The problem is that once you start thinking about, OK, in the bottom example, you have two code snippets that are equal, except that in one case, you're calling f with parameter j, and one in the other one with parameter k. And just intuitively speaking, you can imagine calling f, which expects a convergent labeled parameter, with j, which is itself labeled convergent, should be fine, whereas calling it with k, which is itself not labeled convergent, might become a problem if some transform is applied to a function that is then calling g. So this kind of talk, I think, is not the right place to go into the details of those. There is a link at the bottom to an open fabricator thing, where there was a discussion on this, and I hope to continue in the next couple of weeks to hopefully get this resolved. OK, thanks. And I guess we have some time for questions. Yeah, question, but it's to do with how the reviews go. OpenCL is quite prominent in LLVM, and I'm surprised you're tripping in so many things. I mean, this should be pretty straightforward for the OpenCL people. So are you finding that they are resisting, or? So the question is, this should be pretty a common thing in OpenCL. So how does the review go? I don't find resistance on that end. So one thing to mention is that texturing in OpenCL is maybe not that common. And even in OpenGL, it took us a long time until we ran into a case where this was a problem, because usually you don't have code where the optimizer then breaks things. The issue is really more with, it's a constraint that does not map at all well onto SSA, because you have this intermixture of control and data flow, and we're trying to get it right. And I had to put it. Exactly. We don't want to force global value numbering, for example, to have to deal with this. And that makes it challenging. And then I had to put it on the back burner for some time. But yeah. Yeah. OK.