 Hello, everyone. So welcome to my first open source graphic class. I'm Boris Brezion. I'm working for Collabora for nine months now. And today, I'm going to teach you our GPU's work. Actually, no. I'm not in a position where I can pretend teaching you our GPU work because I only started working on GPU's like six months ago. Yeah. What I'll do is instead share my understanding of these things. And of course, if I'm wrong, then it's safe to interrupt me and correct me. And also, don't blindly trust everything that's written in those slides. If you have to use that, then go check on your own and make sure it's correct. So what is this talk about? Maybe I should first start by telling you what it's not about. It's definitely not about teaching you how to use a graphics API. So no OpenGL, no Vulkan in there. It's also not about teaching you how to develop a driver. Both of those things would take much more than 40 minutes. So what I'm trying to do here is trying to explain our GPU's work, how to interact with the GPU from the software stack, and what the Linux GPU stack looks like. But before we go look at the software stack, let's first have a look at what a GPU does. So basically, a GPU is here to display 3D contents. You'll pass the old things to the GPU, like the model you want to display in the scene, the textures you want to apply to those models, some extra information like the transformation you want to apply to those models. And then you put that to the GPU and you get a nice juke's box. But in the middle, it's like kind of a black box. Actually, it's a blue box here, but that's the same thing. So let's have a look at what's inside this black box. We actually have two main stages. The first one is the geometry stage. And it's where the coordinates you will pass to the GPU will be transformed into a geometric shape and split into basic primitives. So here, you can see that we chose triangles, but it can be quads or it can be lines or whatever. And then this geometric shape is passed to the Rasterizer stage, which has also passed textures or multiple textures. And those textures will be applied to your geometric shape. And you end up with some 3D content you can display on the screen. So of course, that one is pretty basic. But when you look at a game, you actually have n-reads of models with complex textures. And those models are much, much more complicated than cubes. So yeah, that's what a GPU does. So let's have a closer look at both those geometry stages and Rasterizer stages. The geometry stage is here to turn those vertices, those coordinates, into this geometric shape you want to render. The first step is passing those vertices to what we call a vertex shader. It's part of the pipeline, which is programmable. So the user will be able to define how those vertices will be transformed. So you have some basic transformation, which happened almost all the time, like putting the model in some complex scenes, or placing the camera somewhere in the scene, or doing some kind of projection. And then you have some things that are completely application-dependent. So you can do any kind of transformation on those vertices that you want, because it's completely programmable. Then you have two stages, which are optional. So depending on your GPU, it will be supported or not. And those are the geometry and the solution shaders. Actually, I didn't put any diagram on those, because I don't really understand how they work exactly. But the idea is that, based on a set of vertices, you will be able to generate new ones. And so that instead of passing a thousand or millions of vertices, you will be able to generate them from the ones you already passed. So it saves some memory bandwidth when you use those tricks. Then we go through some fixed function steps. The next one, after the geometry and the solution shaders, are the primitive assembly. So it's about taking all the coordinates and linking them so that you can form basic primitives. Here, we are creating triangles. And so you have a cube, six spaces, 12 triangles. Next step is about clipping. So the object you want to render might not be completely inside the scene. So you will have to move it to one of the sides of the scene. And so this step might generate new vertices and then new triangles. Then you go through the back-face tooling. This part is about optimizing the rendering. So everything that's not visible should be hidden. And so you will, during this phase, drop some of the shapes, some of the basic primitives. So if you look at this diagram, you'll see that all the faces that we don't see, we just discard them. And finally, you will have to place this shape inside the window you want to display it in. So you will have to scale the shape and then move it where you want to have it displayed. So that's how you end up with your geometric shape. Then what happens is that the GPU passes this shape to the next stage, which is called the rasterizer stage. And here it's all about filling pixels with some colors. So the first step is trying to determine which pixels are covered by all of those simple primitives. So you will pick one triangle, then put it on the pixel grid, and then determine which pixels are covered. Once you have done that, the triangle setup, you will try to fill all of those pixels with some colors. And those colors are usually taken from a texture, but it can also be a pretty fine color. And finally, once you have done that, what you will do is look at what the previous pixel was, and then determine what the final pixel value will be. So if you add a triangle on top of another triangle, and the new triangle is above the old one, then it will completely hide the old pixel. Or if you determine that the triangle is partly transparent, then you will have to do some blending. But the basic idea is that the merging stage is here to determine what's the final color you will have displayed on your screen. So that's basically it. You have this complex display pipeline, and you end up with something like that we've seen on the first slide. But that's great. Now let's look at what's actually inside a GPU, otherwise I mean. So we've seen that in the pipeline, we have some fixed functions and some programmable function. The fixed functions are all in red in this diagram, and you will have the texture units, the triangle setup, the riserizers, the blending units, and a lot more. Basically, everything that takes a lot of time is optimized and put in a fixed function. And then you have the programmable parts, which are called shader cores. And inside those shader cores, you actually have one or multiple ALUs. And those are where all the processing happens inside the GPU. And next to those cores, you also have some blocks that are here to optimize things, like caches, because you want to hide the memory latency, or a shader because you want to parallelize a lot of things, but still don't add too much cores. But the basic idea behind GPU processing is going massively parallel. And why do we want that? We want that because when you think about it, you have a lot of vertices. When you think about a game, you have a lot of moles. The scene can contain like 1 million moles. Each model contains a lot of vertices. And then when you want to display the final core role, you think about the resolutions, like 1080p or even 4K now. When you think about it, you just have a huge amount of things to process. And luckily, all of those things can be processed completely independently. So it really calls for parallelization there. And the last thing is that we want things to be rendered in the sentiment of time. So we can't just serialize things. We have to do things as fast as possible. So how do we do that? The first trick is using CMD. So CMD is you take a single restriction, and then you apply it to multiple data. In the case of GPUs, that means that you will take a single restriction, and you will apply it to multiple vertices or multiple fragments. Another way to optimize things is to use fixed function units. So we've seen that we have a lot of them in GPUs. And the last thing is putting a lot of cores. But when you put a lot of cores, that means that you have somehow to reduce the size of each core. And if you want to reduce the size of each core, that means that they have to be as dumb as possible. So no fancy stuff like we find in CPUs, out of order execution, smart perfection, branch predictors, we don't have that in GPUs. Or not as smart as we have in the CPUs, at least. So yeah, we have a solution. CMD, a lot of cores, and that's perfect. Well, actually it's not, because it doesn't work in practice. Yeah, just two problems that we have. And I guess there are many more than that. The first one is, when you want to access memory, you have to go through the memory bus. And usually it takes like 100 cycles or even 1,000 cycles. And that means that during that time, you can't execute anything else. You have to wait. So we have to somehow hide the latency that is incurred by all memory accesses. We have a solution for that. I mean, in CPUs, you have a lot of caches. And that works great, because you want to access some piece of memory. You access a bit more, and then you put it in the cache. And then the next access, which is close to it, will just hit the cache. And you get the memory almost instantly. But when you do that in GPUs, that means that you have to add caches in all the cores. And then L2 caches and so on. And as I said, we try to not use too much space, not use too much gates. So the other approach is to instead use multistrading. That is, you will try to prepare a lot of things to do, pack all the frameworks, pack all the vertices, and then put them on hold until one of the threads is executed and is blocked, because it needs to do a memory access. When that happens, then the GPU just put this block on hold, put this thread on hold, and then pick another one to execute it. And since you have a lot of them waiting in parallel, then you can somehow hide this latency, because you have a lot of things to do in the meantime. So that works great. The other approach to help optimize the parallelization is SIMD. But with SIMD, if you want to use the GPU efficiently, that means you have to keep all the ALUs busy as much as possible. And that means that when you execute a shader, you have to avoid punishable branches. Because if some of the fragments or vertices goes in one branch and the other goes in the other branch, that means that part of the ALUs are idle for some time. So you have to find a way to pack all those things so that you get a good use of your GPU. Those are some complexities you have to deal with. And the hardware can help with that. But you also have to deal with that from a software point of view. And that's why GPU compilers are so complex. Because you have to think about all those things when you want to optimize code for a GPU. So that's it about the hardware. Now let's have a look at how you actually interact with this hardware from the software point of view. The CPU is here to deal with all the apps which are running on the machine. But it's also here to ask the GPU to do something. And when it asks the GPU to do something, it has to pass a lot of data, think about vertices, think about all the textures, and all the stuff that actually represent the huge amount of data. So how do we do that? Well, the simplest way is to just put everything in memory, then tell the GPU, OK, everything you need to execute this is in memory at this position. And go, do what you have to do. And let me know when you're done. And that's basically how it works. What you have is what we usually call a command stream. And this command stream contains a few blocks. Each block is describing a specific operation. Like for example, if you have a vertex shader, usually you have a vertex descriptor. And then it points to another block of memory which contains all the vertices. And also another block which contains the byte code for this shader. And you have the same thing for the fragment job. And you also have other kind of job which I don't describe it, but it's highly GPU dependent. So yeah, the driver basically has to describe that and pass this piece of memory to the GPU. Great. So we have a good idea of how we interact with GPU, what's inside the GPU, and so on. Now let's have a look at the graphics stack. When you look at the graphics stack, you actually have sort of components. The first one is the application. This application interacts with a well-known graphics API. You have several of them. OpenGL is probably the most well-known one. You have Jack 3D for Windows applications. And you also have Vulkan, which is quite new and probably taking over OpenGL at some point. And then behind those APIs, you actually have drivers. Those drivers are splits and two pieces, one which is running into space and one which is running into space. But let's first have a look at the graphics API. So why do we need that? When you look back at the graphics plan and at the command swing stuff, you can't imagine how complex it is to actually get it right for a specific hardware. So what those graphics API do is abstract those other specificities. Yeah, you might have this choice. It really depends on your hardware. So if you use Intel GPUs, you can use OpenGL, Vulkan, and also even directory now, thanks to the Mesa drivers. And as I said before, part of the pipeline is programmable. And so you have to somehow describe what you do in those shaders. So what this part is actually done in separate language, which is called GLSL or HLSL. And when the user space application wants to pass this program to the GPU, it actually has to pass it first to the graphics API. The driver has to compile it in some other specific bytecode. And then it can pass it to the GPU through the command swing. A few words about the main ones actually in the open source space. So we basically have the choice between OpenGL, which is the old API and Vulkan. And they actually have two different approaches. OpenGL is about hiding as much as possible the complexity about what's happening in the GPU. And that works pretty good. I mean, when you want to develop a 3D app, you do it in OpenGL and rather simple. On the other hand, that means that the driver has to do some guessing about what the user wants to do if it wants to optimize things. While Vulkan tries to expose as much as possible the hardware complexity so that the user can take the best decision for its specific workflow. And of course, when you write Vulkan application, it's likely to be a bit more complex. But on the other hand, it's likely to be much more efficient because you get full control over the GPU pipeline. So yeah, if you have the choice, I guess it's a pretty good idea to actually try writing application in Vulkan. There are two reasons to do that. The first one is that it will probably worry better than an OpenGL app. And the second one is that you will actually learn about how GPU works internally. And that's a good thing to understand how things work because then you can take the appropriate decisions. So now let's move on and go to the drivers. The drivers are quite complex. And that has to do with the GPU complexity, actually. So we could have put drivers directly inside the panel. But since they are complex and actually not all of the code needs to run in privileged context, it's actually better to move the part that can be in user space into the user space and then have the part that deals with hardware interaction in the space. And also, the beginning in user space is much, much easier. And the last thing, but I know sure it's a good reason, is that when you put something in user space, you don't have all those licensing issues. And that's probably a good reason for all closed source drivers to deal with it, to deal with drivers like that. So let's have a look at what's inside a kernel driver. The kernel driver is responsible for three things. Memory management, so allocating buffers, freeing buffers, passing buffers, and so on. It's also responsible for taking a command stream, executing it, and also doing the multiplexing when you have several applications accessing the GPU. So it's also here to schedule all the command stream it has received. And finally, it's here to signal when the GPU is done executing the command stream. For all open source user space drivers, you have a kernel space driver, which is normally in mainline. So usually it's in driver's GPU, DRAM. And for closed source drivers, we don't have kernel drivers in mainline, simply because it's a policy that DRAM maintainers want to enforce. They want to have user space drivers which are open source before merging kernel space drivers. And that actually makes sense, because we want to push for an open source solution. So that was for the kernel space driver. What about the user space driver? So that one is the most complex one. I mean, it's here to implement the graphics API. It's here to compile shaders. So all the compiler complexity is put in user space. It's also here to create the command streams, pass them to the kernel, do the synchronization with the kernel space drivers. And it's also here to interface with the windowing system manager. So all the interaction between your Wayland compositor and your GPU is also done in the user space driver. So we have a solution for open source drivers in Linux, which is called Mesa. And depending on the API you want to implement in this library and in this middleware, you will have two different approaches. The one for GL is try to abstract the GL calls and then define a standard driver interface and then pass all the calls to the specific driver. And for Vulkan, actually, it doesn't work like that, because Kronos has its own driver loader and own a driver abstraction layer. So all that's implemented in Mesa for Vulkan drivers are actually just the drivers. And nothing abstracted for the API itself. And then for code sharing, they just used leaves. So a few helpers which help factorizing some code, but there is no abstraction. Regarding the implementation of drivers, you actually have several choices. I want to tell the pre-Galium one, because it's not supposed to be used anymore. But there was a specific interface which drivers were implementing before Galium. Galium is about trying to abstract all the state matching specificities and make the driver only deal with hardware specificities. So what they do in the Galium abstraction is that they take calls from a specific API. It can be OpenGL, it can be Direct3D, or there are also other state trackers. They transform that in some kind of generic calls, and then they pass the call to the specific driver. And depending on your hardware, it will go through a different driver. So if you have a GL call and only have a Mali GPU on your platform, then it will go to the Benforce driver and pass it to the Cannot Driver counterpart. And so that's how the abstraction is done inside Mesa. For Vulcan drivers, it's a bit more complicated. You actually have no abstraction at all. That means that currently drivers are kind of duplicating a lot of code. Even if part of it is put in some libraries, common libraries, most of the code, I think, could be shared a bit more. But the idea is that the layer between the graphics API and the actual driver is much, much sooner than with Galium. And part of the reason why it's done like that is because you don't have a state matching anymore in Vulcan. So you don't have to deal with all the state trackers seeing that it's done inside Galium. Or at least that's my understanding. So yeah, it was pretty quick over you. I won't pretend you will be able to develop drivers based on that, but that gives a pretty good idea of where to look at and how PCs are interacting with each other. Keep in mind that GPU topic is super vast, and you won't be able to get a grasp on all those concepts in like two weeks. At least for me, it's not possible. So if you want to start developing a GPU driver, what you should do instead is focus on a specific feature or focus on a specific bug and go digging into this direction. And keep digging until you actually understand the underlying concepts. And the most important thing is don't give up. Keep reading stuff about GPUs and keep learning about that. A few useful things if you want to learn about GPUs. There are quite a good blog post about that. Our GPUs work. And of course, you can search in Google and you'll find plenty of them. Also, if you want to dig into the MESAC code source code, you should probably have a look at the documentation because the source tree is a bit hard to follow. So yeah, you should probably have a look at that. And the DRAM subsystem, the genocide of things, is pretty well-documented as well. So yeah, you should probably have a look too. And that's all for now. I hope you were able to understand a few things. And if you have any questions, maybe I can answer them. Is anyone working on the PowerVR open source slash free software driver? Because I know that the other GPUs have now a free software driver, which is the PowerVR. I'm pretty sure no one is working on that, unfortunately. And I heard that the architecture of the GPU itself is quite hard to deal with. And so it doesn't really fit in the model we have right now. The modeling we have right now in MESAC. So no. Hi there. So I was wondering when you want to do general purpose computation on a GPU. Do you know if there are any significant changes to the hardware or architecture or anything that needs to happen to make that possible? Or is it just a case of using the graphics pipeline in a more non-graphical context kind of thing? I guess the compiler can help optimizing things for your specific workload. So it has to do with the driver side of things. The MESAC driver will try to optimize some specific workload for, I guess, you're talking about OpenCL or this sort of thing. General purpose GPU. It's the responsibility of the driver to try to optimize something. OK, thank you. And you have some tools to help you with that, especially in MESAC. You have the NIO compiler, which is helping a lot with all kind of compiler optimizations. So yeah, you have some tools. No more questions? The question is, what's the OpenCL supporting MESAC? I'm not so sure about that, but I think Clover is about supporting OpenCL and MESAC. And so I guess there's a state tracker dedicated to OpenCL and sign MESAC. Yeah. Not entirely sure, but I think this is the case. Is Vulkan only working with one specific driver, slide 32? Sorry? On slide 32? 31. Sorry. OK. Vulkan is only MSM driver or test? It's the Qualcomm one. So Friderot is the open source driver, and MSM is the kernel side of the open source driver. Can you shortly explain what a state tracker does? What is it? So a state tracker is about maintaining the state of the graphics API. If you look at how OpenGL works, it's actually a huge state machine. And so every time you do an operation, then it changes the state. And the state tracker is here to help you keep track of that. So that drivers only focus on executing things and nothing else. And so you can have the same state tracker for all drivers. And that's actually what Galiom is designed for, factorizing all those state drivers and making them come onto all GPU drivers. Thanks. OK. Thank you.