 Hello everyone, welcome to my talk. I hope everybody's not too tired after lunch, so let's see. My name is Lukas Stach. I'm a graphics developer at Pengatronics. I'm specializing mostly in display, output, GPU stuff, and mostly embedded devices. So that's why I'm here on the Embedded Conference. And yeah, I'm going to talk about why GPUs are maybe fast. So everyone is telling us that GPUs are those magic go-fast devices. Everyone calls them accelerators that make things go fast. But they're not always performing up to this expectation. So yeah, well, they aren't always. Why is that? Turns out GPUs are magical. They just solve a very fundamental problem of computing in a quite different way than CPUs, for example, do it. And yeah, let's dive in. So the fundamental problem of computing is that the DRAM scaling is not scaling as fast as compute is doing. So if you look at this chart, compute would be a linear scaling on the log scale. DRAM is scaling OK in capacity. It's scaling much more badly in bandwidth. But access time to DRAM is essentially flat since forever. So accessing your DRAM arrays is not going to be any faster or much faster or new technologies. So for example, that's just from a modern mid-range SOC system. You would get DR4 RAM at 4,000 megatransfer per second on a 32-bit bus. And that works. Yeah, maybe 20 nanoseconds of access time, depending on the access patterns on a DRAM that might be worse or better. But around that number, and that works at 16 gigabytes per second of bandwidth. But with a bandwidth delay product of 20 nanoseconds and 16 gigabytes per second, that works out to 320 bytes of memory transactions that you have to have in flight at any point in time to actually keep the DRAM bus busy. So in a very naive example, the access to DRAM would look somewhat like this, where you're working out which part of the RAM you want to access. You kick off that access fetch and a few things from memory. And then you wait for the data to come back from DRAM. So the memory access pattern would look somewhat like that. So how do you improve that? Yeah, the obvious solution in computing is adding caches. So you're fetching more in a single burst. And then you hope that your workload has some spatial locality. So you can reuse the block that you just fetched. And you get a better utilization on the DRAM side. But still, after you've worked out what to access, you have to bring it into your caches. And then you have to figure out what to do from there. So how do we fill those gaps between the actual DRAM accesses? If you look at modern CPU technology, that's going to be speculation and prefetching. So your CPU is running ahead of the memory and tries to guess what it has to do next. And then trying to work out your memory access patterns over time and try to bring that into the pipeline or the CPU caches before it's actually needed. But yeah, that's a lot of hardware you have to invest on the hardware side. You have to do all this tracking of access patterns, try to work out where your branches are going, have large caches to actually make it effective. Yeah, that's one way to spend your hardware budget. But it turns out there is another way to do that. And that's what GPUs are doing. So if you imagine a typical application for GPU, it's filling your screen pixels with a certain color where the GPU has to work out which color to put into each pixel. And if you're having this grid of pixels, you'll see that this is a really easy task to paralyze. So you can do all the pixels or a subset of pixels in parallel. The actual latency of doing that only is defined of how fast can you fill the whole grid, how fast can you fill the screen with colors. It doesn't matter if a single pixel in your grid takes a certain amount of time. What matters is how fast can you fill the whole grid. So what GPUs are doing is just add less sophisticated execution engines that don't do all the speculations, don't do much prefetching. OK, modern GPUs, big GPUs also do prefetching a lot. But at the simplest level of a GPU, you could just add more compute cores that are taking care of individual pixels in this grid. So you can do with less sophisticated hardware, but add more of it. And if all those are working in parallel, yeah, you just get a nice stream of processing units, requesting things from memory, and then getting it back and doing all the work. So there's no speculation involved. All the execution engines know what they want to have on data from memory. So you're not wasting any bandwidth on misspeculation, basically, or miss prefetching. But then you'll see there is still your execution engines are waiting for data to arrive from memory. So that's wasted hardware again. So if you add four execution engines that could do four of those pixels in the grid at the same time, they will spend most of their time on data getting back from VRAM. So what can you do to reduce this waste of hardware that's just sitting idle waiting for memory to arrive from the memory bus? You just make your register file much bigger and share your actual execution engines. So basically, you're doing barrel processing with the first pixel being worked on by the execution engines with this part of the register file and the second pixel being worked on with this part of the register file. And then each instance working on those pixels can just go off, do some computation, work out what it needs to fetch from memory, kick off those memory requests, and then the execution engine will just switch the register range to the next thread and work out where to access memory from there and kick off the memory access. So you're basically sharing a single computation engine across multiple threads. So, yeah. So you can really do a lot more threads on the chest replicating your registers but not replicating your actual execution engines, which is good. You're reducing hardware waste. But there's also an obvious downside here. You have a single shared register file. So if each of those instances filling a single pixel uses more registers, you could have less threads in flight. And again, if you're having a lower number of threads in flight, you're not able to keep this memory pipeline busy with requests. So, yeah, there are less opportunities to hide the memory latency again. So, yeah. What I really want for you to take away from that is that GPUs aren't optimized for latency on a single execution thread, but they're optimized to do really parallel workloads. And they're really good by design for hiding memory latency and not waste memory bandwidth by mis-speculation or something like that. But then there's the obvious downside that all this crashes down if a single instance that you're computing is too complex. You're going down to a lot less threads being able to be in flight on a GPU at the same time. Or if the problem space itself isn't really a parallel problem. So filling a screen of pixels is an obvious parallel problem. But if you're trying to do something else on a GPU, it really depends if you can segmentize your problem into a parallel problem for the GPU to perform well. And that's for hardware. So I'm not going to annoy you with all the netiquity hardware details for the rest of the talk. Moving up from the hardware, we have the GPU driver sitting on top of it. So GPU drivers in the Linux world are generally split between the kernel mode driver and the user mode driver. And as you can see from the size of those boxes, the kernel mode driver is typically much smaller than the user space driver because you have all those big APIs sitting on top of it to actually exercise those acceleration capabilities from the application side. So the obvious one is the OpenGL for rendering graphics. Yeah, the newer one on the blog is Vulkan. And then if you're doing GPU computing, you're probably going to do OpenGL. And there's a really, really big driver up in user space that takes this really complex APIs that meant to interface with the application and break it down to a level where the hardware understands the commands. And that's all integrated in the Mesa library in Linux, at least if you're talking about the open source drivers. Other vendors are providing closed source user space libraries to do the same thing, basically. And then you have, hopefully, a stable UI API in the Linux world and a small kernel driver, which is then talking to the hardware and hooks into the Linux memory management, making sure that memory that you need to buffer your commands and the data that the GPU is operating on is available to the GPU when it needs to. So it's doing memory management for the GPU and stuff like that. But that's typically a lot smaller layer than what you have in user space. So each time you have to traverse this boundary, that's quite costly. So yeah, your application might want to draw a lot of primitives to fill the screen, draw a lot of triangles, do a lot of pixel fills, whatever. So you don't want your user space driver to traverse this boundary each time you do a single draw, which might fill one or two pixels on a screen or something like that. So the user mode drivers typically batch up those commands and have large batch buffers that are then sent down to the kernel driver and to actually execute on the hardware. So yeah, it basically is a classic example of trading throughput for latency, where you're batching stuff up to reduce the cost of a single operation. Yeah, it is possible to reduce this latency and make the batches smaller if you need to. All the graphics APIs are providing ways to tell the driver now actually make sure this thing is executing on the GPU in the smallest amount of time. But obviously this does break your batching opportunities and so drives up the cost that you have to pay on the CPU side that's actually preparing all the commands for the GPU to execute. And you might have problems filling your GPU pipeline this way. So be wary of that. Additionally, GPU drivers tend to optimize quite heavily for throughput again and trading off latency by allowing your CPU that's preparing the commands for the GPU to execute to get ahead of the GPU. So in a simple example here, the GPU is working on a job and the CPU at the same time is preparing the next job for the GPU to execute. So it's running ahead of the GPU and when the GPU finish the previous job there's already commands prepared for the GPU to pick up and work on the next one. So there's a pipeline again and yeah, you'll typically see that the CPU is running ahead a few single digit number of frames or something like that. But then from the application perspective, you might want to read back results from what the GPU did. If you're just putting pixels on a screen, you probably don't want to do that. You just want to put them on the screen. But sometimes there's application use cases where you want to have the data from the GPU back. And obviously that's introducing bubbles into this nice pipeline we've seen before. So if this job depends on a result of the previous job, the CPU obviously has to wait for the GPU to catch up and actually provide that result before we can prepare the next job. So that's going to introduce a bubble where the GPU isn't busy anymore and it's just sitting idle. At least if you're just talking about single application and there's not other applications in your system to pick it up, yeah. So that's the next job. So that's the obvious thing to avoid. All the big graphics APIs allow you to extend this pipelining into your application by basically inserting markers at that point where you tell the driver, I want to know when this result from the GPU is available and then you don't do the naive thing and wait for the result of your previous job, but you do the pipelining in the application and then wait for the result of a job before that. So you're extending the pipelining and that's a really good idea if you care about keeping your GPU busy. So that's the obvious cause for pipeline bubbles. And not so obvious cause for pipeline bubbles is shared data. So hopefully example you see here, the CPU is preparing a job for the GPU then kicking it off for the GPU to execute. GPU does this in parallel to my CPU time and now I'm preparing the job after that or next frame or something like that. And then I decide to update some data that's also accessed by the previous job. And as I can't tell where the GPU is exactly in that job, the driver basically has to wait for the GPU to finish working with this data. Then I can continue updating the data I'm going to use in the next job and then I can kick off the next job to the GPU. So that's introducing another pipeline bubble there, wasting cycle time. So all the previous slides just talked about what happens at a single application level with a single application doing some work for the GPU and then trying to work with the results. What if you actually want to put this data on your screen of the device? Typically you'll add a compositor or something like that which again we'll use. So the way it works is the application talks to the user mode driver via those acceleration APIs to draw stuff and then there's other interfaces. Basically if this is a Wayland compositor, this connection would be the Wayland protocol where the application is talking to the Wayland compositor and there's APIs on top of that like EGL for the OpenGL world and Vulkan Reno System integration for the Vulkan world. So the application is preparing a frame, kicking it over to the compositor and the compositor might then again use the GPU to actually put this picture or single application surface onto the big screen and mix it with content from other applications or something like that. So there's an obvious shortcut here if the application is the only application running it's single application full screen. The compositor could probably just take the application provided buffer and put it on the screen. So you have another kernel level driver that's driving your display engine and you're talking via KMS kernel mode setting to that and just kicking the buffer off. But for now we'll assume that the compositor actually has to do some work and use the GPU to put the content from multiple applications together and that would look something like that and I think you can see a pattern here with pipelining, pipelining, pipelining everywhere in the stack. So while your display is busy displaying one frame or showing one frame, your compositor is already preparing the GPU commands for the next frame and then kicking that off to the GPU to prepare the frame to display in the next display cycle. So typically your displays are running at a fixed rate like 60 hertz or something. If you're not talking gaming displays that are going much higher or being able to do variable rates, that's a fixed rate and you always want to, yeah, do new display updates at a certain time interval and what you see here is the compositor is preparing the next frame but to do that it's working with a data that's just being processed on a CPU for the application side and what you're not seeing on a slide is that the application side of CPU application side preparing that frame would be over to the left. So yeah, there's another level of pipelining involved there. So throughput, throughput, throughput. Yeah, so all this pipelining is really the easy way to keep your hardware busy and performing at the level that you actually expected to perform but yeah, it sacrifices a lot of latency for throughput improvements. So if you keep in mind the previous example from the compositor, yeah, you're basically scanning out the frame to the display that your application has worked on two frames before that. So at 60 hertz display you're getting 16 points, something milliseconds per frame so you're really talking about double digit milliseconds of latency here. If your display is working at a lower rate that's even lower or longer latency. So yeah, there's ways to improve that but again, that's actually not that easy to do if you want to keep the hardware busy. So with quite clever scheduling you could just make the application prepare the frame just in time so you're kicking off the prepare stage here and then the GPU is working at that time and your compositor is working in between and you're hitting just the next frame so you're basically down to a single frame of latency but what you see here is that you'll have bubbles again in the pipeline of the GPU. Either you have a lot of GPU time to spare to actually do this or your scheduling gets really tricky because if you want your GPU to be utilized all the time you really need to hit the right times to actually schedule those, prepare and GPU execution stages. And what could happen if you're a bit too clever with scheduling is something like that. So if you prepared with previous slide you'll see that only the GPU time of the application side changes. So the application needs a bit more time on the GPU to actually produce the content to put on a frame, put on a screen. And now as you're trying to hit just the right time to update the screen for the next frame you're actually going over that boundary where your display is able to pick up a new frame for the next frame on the display. So basically now you have to wait all until here until your picture or new composite content is actually showing up on the screen. So if you want to reduce latency in a smart way it really gets hard. And you see that's only a very minimal impact on a GPU time and has a huge impact on when your picture is actually showing up on the screen. So the takeaways here is that GPU drivers are really tuned for throughput and sacrificing latency in the process. So if you really need good latency if things get really hard really quickly. Yeah, the latency reduction is possible but it's hard to do. And with that I think I've optimized a bit too good for latency and I'm at the end of my talk and be able to take questions. Thanks a lot. How much of all these applies if you are working with data, if you are just sending data for whatever you want to do and then you are collecting it back with DMA or something display not involved. So you don't have these slots of display. But it's basically the same thing. You just get the last bits of the presentation out of the picture where you have the compositor and another layer of pipelining on top of it. But if you're just working with data, it's the same thing. You prepare a chunk of data, have the GPU work on it and then if you're synchronously waiting for the results of the computation on a GPU to come back to work with the next chunk or prepare the next job for the GPU, you're introducing this pipeline bother. So yeah, that's I think, yeah, that's basically this picture. If you're, this doesn't have to be pixel data you put on a screen. A GPU job could just be some generic compute job where you're putting some data into the GPU. Might increase the job, yeah. But yeah, if you're synchronously waiting for a result to come back before you can kick off the next job, you're introducing a pipeline bubble and have the GPU sit idle waiting for the CPU to prepare the next commands, yeah. Where's the mic? Is the graphics library like Mesa capable of doing some job reordering to better utilize the pipeline? There's some tricks that GPU drivers do, yeah. So if you're looking at something like that where you're accessing shared data, the GPU driver will actually work very hard to hide that from you and if possible just insert new memory under your API level objects. So it tries to avoid this very hard but in some cases the GPU driver just isn't able to, yeah. Do magic and hide the stuff from you, yeah. Is there any good hardware platform, community friendly hardware platform we could use to play with all this? Like a GPU? There's multiple open source platforms right now. So yeah, the ARM Mali GPUs are really well supported in the upstream and Vyvante GPUs are really, yeah. They're not that good at the API level. We didn't get around to implement all the API level stuff on Etnavif because yeah, time and money but yeah, there's open source drivers where you really can dig into that and play around with that stuff, yeah. I think the Vyvante GPU has a rather simple architecture, right, so it's easy to access all this development maybe through it. Right, so many of the embedded GPUs are actually tiled renderers where you do a lot more magic and hardware. Again, pipelining stuff and trying to do things in a more optimal way and yeah, that's maybe a bit more harder to yeah, really dig into how things work if you have another level of pipelining on the hardware level, Vyvante GPUs are pretty simple and really straightforward. So you can really see, okay, I'm doing that on the driver side and it translates to the GPU doing this thing, yeah. Any other questions? Not, we're having a few minutes left and I have another bonus slide for something that might be interesting, which may cause latencies in the system that you don't expect if you're working there easily with a GPU driver stack, so that's fences. They're basically a way to track work that we are committed to do but didn't get around to do yet. So if a user space driver prepares a new job for the GPU, it kicks it off to the kernel, it might not execute immediately on a GPU because the GPU might be busy with something other but you're getting back a fence which is telling you, yeah, if you wait or if you look at this fence and it's signaling the work is actually done. So there's some guarantees of completion of those fences. When you get them back, you could just schedule the next part of the work. So it's a way to track dependencies between all those jobs going on on the GPU. So something like that, different applications or different parts of your application doing work in chunks and they all depend on each other. In the current Linux stack, that's pretty well hidden from you because all those fences are on the actually data buffers that are used to share between those jobs. So there's implicit fences as they're called and the Linux kernel will keep track of them for you and then insert the jobs in the right order at the GPU level. So yeah, you might have one job here or here, job free where the CPU is waiting for the result of that job and if they all depend on previous jobs, which might be actually be bigger than your last job, yeah, those fences and the CPU running ahead of the GPU might actually cause you to have large latencies here at that stage where you're waiting for a small job but it's depending on bigger jobs that are running on a GPU or the GPU being busy for a time and yeah, you'll see big latencies in the system. There's work underway to improve that or at least improve the visibility of that with explicit fences and upstream stacks. So actually the part of the application or the application preparing the last job can see that there's still fences pending for previous jobs that haven't started or haven't finished yet. Yeah, so that's something to keep in mind but that's really at the low level details but if you're interested in all the low level details and learning about them and working with them, we are hiring for our graphics team. Thanks everyone.