 All right, yeah, thanks all for joining, nice full room. So yeah, my name's Daniel. I'm the graphics leader at Calabra, where an open source consultancy, I've been working on graphics sort of incrementally and by accident, and definitely when I started, it was a weird, mystifying thing that took me quite a while to get my head around really. So yeah, I suppose in contrast to some of the earlier talks, I wanted to just try and break down a lot of the components that you see in the modern graphics stack, how some of those flows look like and give you, I suppose, a bit of a toolkit to be able to understand and reason about all of these kinds of things. So yeah, in our line of work, we have a lot of people who are working, particularly on single board computer and SOM devices, and they end up pushing, like really pushing the capabilities of those kinds of devices. It used to be a huge area of revenue for us because the old graphic stack was so hopelessly inefficient that we made a great amount of money from I want to ship a product, but I need to pay someone to spend six months to a year just hacking everything to shreds in a completely non-repeatable way. And so a lot of the work that we've done over the past 10 to 12 years has weirdly been to try and get less revenue for ourselves. That whole idea of if you make a problem as boring as possible and commoditize it, then you can go on to do other things and not just be doing the same stuff for 10 years on end. I'm not sure we've made it the most clear and easy to follow in the process, but we're always trying to clear it up and trying to make sure that it is pretty accessible. So hopefully that's what we can get out of today. So here, as for a typical graphic stack that you would see in a consumer device, something like say a digital signage device or a mobile phone, since that's where a lot of our effort and attention goes. Any questions? We'll get to that. Excuse me. So if we take a bit more of an incremental look at it and try to approach it in a more sensible way, the way I usually like to think of things rather than top and bottom is nearest to the eye and furthest from the eye. And conceptually, you've got the display subsystem will be closest to your eye and then you've got the window manager and the window system as that kind of nice little bit of intermediary middleware. And then you've got the clients which they're the things doing the 3D rendering and media and stuff like that, right on the far end from the hardware really. Except that your window system also uses the GPU so that kind of gets complicated and maybe it's encoding media to stream to Twitch because that's what the kids like. And then maybe your window system client is also a window system, but it's all quite generic and repeatable. So you can actually get pretty deep in some weird nesting and circular dependency games if you really feel like it. And if that's your passion, come work on the graphic stack because we're always in need of people. So if we start with the display side of things, DRM is the direct rendering manager subsystem in the Linux kernel and it's both GPU and display. So DRM is the overarching driver framework and set of drivers that we have. In kernel mode setting is the part of it that's really specific to the display pipeline. And that is just the final step of turning pixels into like KMS isn't involved at any earlier point has nothing to do with 3D rendering or where your windows are positioned on screen or anything like that. Yeah, it's just turning memory bits into electrons and you know how hard could that possibly be, right? Some people will tell you that FB Dev can be used to do display as well and has historically been used. Those people are wrong, don't listen to them. So then if we go on our next step a little deeper through, we get to the window system. So using Wayland as the example that everyone knows and loves. Fundamentally, all the window system does is multiplexing between you have one set of display hardware and one set of input devices. You have multiple clients and it needs to interpret and route between them. Wayland is just a protocol for doing window systems. It's not a concrete implementation. The only implementation part of Wayland in and of itself is just some IPC and lifetime management and very basic stuff. So you know your desktop, I'm running GNOME. They have their own complete implementation of the Wayland protocol. KDE has another, in fact it has two for various reasons. And then on embedded systems you'll be running a different Wayland server all together like Western for example. So if you ever think there's a bug with Wayland because it just crashed, look at your desktop and blame someone else, not me ideally. So the main visible function is that it collects all of the client windows together. It stacks them, so bits are on top of the other, positions them, gets them out to the display. And similarly that's the one deciding where should the keyboard events go right now. The mouse just moved, what does that mean? Who should I deliver it to? So on. Again, there's a vicious rumor that there was something called X11. It's completely untrue, especially in 2022. I mean more seriously, Wayland does have an X compatibility layer which is just the full X server and everything routed sideways into the Wayland server. And that was really hard to implement because just don't even try to do X11 if you can. Then again, go in another layer down if you're looking at the GPU layer. We usually talk about that in terms of GL and Vulkan. And I guess one of the lesser understood parts of this is that in and of themselves OpenGL, GLES, which is sort of the embedded subset and Vulkan, all they do is rendering accelerated 3D and nothing else. So you give them a wireframe of all your geometry within the scene, you give them some texture images you want painted onto them and some shader programs to do cool effects and tell you how to paint them. But they only do rendering and nothing else. So the bridge to actually getting them on screen is EGL for GL and the WSI for Vulkan. And they glue together the fact that GL and Vulkan are the things which can do this shiny cool accelerated rendering, but ideally you want to show it at some point. There's also, there's another library called GBM, which you'll see kicking about. That's kind of a weird special case of when we need to display things to KMS. The way EGL and KMS work, you can't quite do that directly. So we have this library called GBM, which is a bit of a side channel into EGL so we can punch through some of the abstractions and control much more directly where the buffers are and how we present them. Some people will tell you that it's called the generic buffer manager. It's not that acronym definitely doesn't stand for anything. Never was called generic buffer manager at any stage. So yeah, that's pretty much the top to bottom. You've got your clients rendering the content, it's passing over to the compositor. Here's a handle so you can access the memory that backs that. The compositor is going to do something to get that on screen in some kind of position with some kind of blending or whatever. And then KMS is the one that shovels it out through HDMI. So there's this really clear division, not only of responsibility. Well, that's quite light really. Yeah, not only responsibility, but also in design principles because you have at the low level, the APIs like DRM, GLES, EGL, KMS. They're very, not only are they hardware specific in the implementation, but they're very directed. You know, they do exactly what you tell them to with nothing else. If you tell them to do the wrong thing, that's on you. And as you climb up, you get not only much more hardware independent and abstract. So for example, you know, Mata, West and all of the Weyland compositors, we have no hardware specific code on there. They run everywhere. But they're the first layer at which you start to apply policy and various arbitrary choices that people think make a good desktop. So that's the first real layer at which we're translating from a kind of high level declaration of intention down to some very specific low level realization of that intention. And then obviously by the time you get up, way up into client land, it's, you know, I'd like a button and download some stuff off the internet and magic things happen up there. So yeah, if I wanted to walk a bit through each of those three layers and how they're actually put together and how you go about working with them, be that an implementation or trying to debug them or just trying to understand what's going on. So again, starting with nearest to I, if we're looking at KMS and DRM as well, I suppose. So each system device, you know, it's a sort of classic kernel model. Each system device has its own DRM character device. So this is your dev DRI card, whatever. On a classic desktop system like Intel and AMD, you'll have one device which covers all of your GPUs and displays and the things that use like two kilowatts. On arm systems, classically and risk five MIPS, other architectures are available. You tend to have a separate GPU and a display device. They're not only separate devices, but they're made by separate vendors and they're just glued in by the SOC manufacturer, which will become important later. So yeah, again, if you go from nearest to furthest, the way DRM devices are put together internally, which represents hardware surprisingly well. It's a design which has stood up really well for about the last 15 years. Apart from the one DRM object, which I don't mention here because we just try to pretend that never existed. Connectors are your actual displays. So a DRM connector will be, you know, this type C output or an HDMI output or your integrated panel. That's always one to one. CRTCs, which is old enough that it stands for CRT controller for anyone else with grays in their beard. They generate a pixel stream for the displays. So it's their responsibility to combine all the different inputs for that one display, scale them, apply anything it needs to, to get you one flat image, which is going to go out over the wire. Within the CRTCs though, it's not just a single image. Display controllers can to a limited extent do their own composition, where you can position and stack and scale and color convert images within the CRTCs. So you can have multiple inputs combined together. And then, yeah, frame buffers is just the stuff in memory which arrived somehow as a fully formed set of pixels. So yeah, conceptually, it's just, you attach a frame buffer to a plane. The plane might do some transformation like scaling or like a color transform. And then the CRTC logically combines them together to give you that final image. I think the most important thing is, yeah, again, that's quite light, but the CRTC generates timing fundamentally. If you have a 60 Hertz display, the CRTC is the thing that ticks at 60 Hertz. So within DRM and KMS, that's where everything flows back from is we start from the timing that the CRTC has. We know that that's our deadline. We know that that's our cadence and everything works back from there. So that's always our starting point when we're looking at things like AV sync or performance or whatever. You start with that CRTC timing because that's a fixed physical clock and you make sure that you've got that feedback element going backwards. So everything is properly flowing from there. Everyone knows when their deadline is. And yeah, all of these objects, you enumerate them, they have properties, the properties can have values. It's basically what you would expect. In atomic mode setting is what we've had as the interface for about seven years, which is just a pile of properties for the whole system. So saying, this plane should be at this offset, it should be scaled to this much. Displaying this frame buffer, this connector is enabled, this one isn't, so on and so forth. One of the more useful things that we have is atomic gives you a test-only interface. Display hardware is infuriatingly limited. It can do all of these wonderful things which save you power and save you bandwidth, but they have the world's most exotic set of constraints on when they can and can't do it. We had a session a few years ago where we all sat down and tried to enumerate all of the different constraints that we could think of. So we could expose those as an API and use a space to figure out, all right, this is what I can and can't do. So this is the configuration I need to put together. We got a bit over an hour in and then just fundamentally gave up. So because of the best way which I'll cover in a sec is essentially just brute force checking, checks quick enough because every single frame user space will put together this configuration throughout a DRM, how do you like this? No, how do you like this? No, how do you like this? No, and we can do that 50 times or more per frame. The frame by frame thing is quite important because DRM, it's not a sort of producer consumer. You know, in the media world, you tend to have these pipelines connect thing A to thing B, timing magically flows around, frames are distributed evenly. DRM does what you tell it to until you tell it to do something else. So every frame, if you have any new update, you need to tell DRM, hey, here's your new configuration, possibly 50 times to throw stuff against the wall until it sticks. And yeah, it's again, working backwards when you start working with a DRM device. So there's some code samples I'll link to in a couple of slides time. You start with your connector. So once you've enumerated everything, you've looked at it and said, HDMI is connected. It says it can do 1080p. Then that's how you work backwards. So you start with HDMI A1, and then within the object enumeration, there's compatibility and routing constraints that are there. So the CRTCs will say, I can work with these connectors and you pick the one that works with HDMI A1. And then the planes say I can work with these CRTCs. So again, working backwards, you pull out this tree of objects derived from the output that you want to light up. And yeah, in order to display some pixels, you need to be able to source some pixels. So GEM is called the Graphics Execution Manager. It has nothing to do with the execution at all. It's just a memory manager. And it's a memory manager with no API, apart from destroying a buffer object, which is kinda cool. And all these GEM objects are, every time you see BO, that's a GEM buffer object. And all that is, is a collection of bytes. There's no type, there's no metadata. It could be firmware, it could be geometry, it could be pixels, who knows. And yeah, the reason the API only covers the concept of having a buffer and destroying it is that the hardware is super, super divergent. You think of unified memory versus dedicated video RAM, these kinds of things. So we don't have a generic API at the kernel level, which lets you allocate a buffer to use. And that would be a good point if you said that it didn't sound great to be using. So we have a limited cutout. If you're starting with your first KMS program, what you want to be dealing with is DOM buffers. So they allow this really specifically targeted, you're writing a splash screen or you're writing a software compositor, which just does CPU rendering and nothing accelerated. So that does give you a generic eye-opter which lets you allocate a buffer object for a certain width and a certain height. And you can M-map it and you can wrap it in a KMS frame buffer, which is just describing the properties of the buffer object and that's it. It can be tempting to use that for other things because that's the only generic API our KMS has, but don't because it probably won't work for you and when you try to make it work, you won't get a very good reception when you ask how to make it work and everyone will be disappointed. So yeah, it's quite a simple runtime loop because again, KMS is just doing exactly what it's told. So you've already enumerated all of your objects. You know how the routing looks and that's quite consistent. So all you're doing is making sure you've got your KMS frame buffers. You can attach them to a plane and crop in the process. You can position them within a CRTC, maybe scale them in the process, stack them relative to each other, commit everything. Hopefully you've got it right and it works because you've been sure to do your test commits, wait until the kernel finishes and then you go again. You know, it's a limited enough API and scope KMS that it's pretty easy to approach. It's some tedious typing, but it's not conceptually difficult. So the place I usually recommend people start is the thing that I wrote myself. There might be some correlation. But KMS quads is an example out there which tries to be quite self-documenting in terms of how you approach actually using KMS from the user space point of view, how you work with it and to be a relatively extensible jumping off point for, for your own work on top of that. And yeah, there's a couple of other useful tools like KMS cube is more focused, so you can probably tell by the Mesa part of the URL, but more focused on the 3D side of things. DRM info is a really nice debugging and introspection tool and there's actually some pretty good documentation in the kernel repo, so yeah. So then if we got to using Wayland, there has been a lot of material on how to use Wayland, so I'm gonna keep this one fairly light. Also please use a toolkit if you can, if you're just writing an application. They already do this really well. The most important thing about Wayland is that it does have that sort of multiplexing flow between the multiple clients and the single output devices as well as input devices like your keyboard mouse and so on, but the most important difference is that it's descriptive, not prescriptive, so as a concrete example, when you're working with X11, X clients will say to the server, here's a new window, it's exactly this big, put it on top of everything else, give me all of the keyboard and all of the mouse input until I tell you otherwise. In Wayland, clients say, here's a surface, this is a pop-up, please display it. So in Wayland, we have a lot more scope within the compositor to implement good semantics, so the whole thing's based on, the client just furnishes the compositor with a ton of context and a ton of information and a lot of descriptive parts, which let the compositor implement something good rather than X11 where we ended up painted into a corner because the capability of the whole system is constrained by the sort of least creative client. And then that ends up effectively as your API. The other thing is that we deliberately, we're very light on the compositor because it's such a critical system resource. The entire protocol is really focused around making sure that the compositor doesn't have to do loads of work and that you shouldn't see any blocking or janking. This didn't stop people from putting, blocking the bus calls to Google Calendar, which would block your entire system for a few seconds, but they fixed it now, so I won't name them. And yeah, across the compositor landscape, if you're wanting to put together a distribution or a product or something like that, the rough guide is that Western's designed for all the non-desktop use cases. So anything where you're not dragging loads of stuff around and it's somewhat more of a static configuration, that's sort of quite, it's super efficient on the hardware side and quite stable and predictable. The other end of the spectrum is MOTA. So if you're running GNOME, this is your Wayland server, it's mine as well. It looks really good. It's quite shiny. You can extend it with JavaScript. That doesn't scale down super well, but it works well for desktop use cases. And then WR Roots is kind of in the middle. It's a bit of a toolkit, which is less efficient than Western for sure, but it's also more approachable and more extensible. But all of these, as a client, you don't really have to care because all of the core protocols are implemented the same way by all of us. And this is, from the X11 mod, it's a combination of the server and the window manager and the compositor all in one piece that handles all that policy as well as the realization. Yeah, so quick run through is everything in Wayland is an extension. So clients connect and the first thing they do is ask for the list of extensions, stop binding to them. So that's what you'll always see on startup of a Wayland client is listening to the extension registry and binding to those extension interfaces. And the main ones, a WL buffer is exactly what you think it is. It's some pixels are over here. A WL surface is a window. So that's really a container for WL buffers that the client's pushing. And it's a receiver for input events that the compositor's pushing the other way. But in and of itself, that's all it is. So there's a lot of layered extensibility that we have in Wayland. So your WL surface, you know, if you start up your browser, it will have the surface itself. And then it will wrap that in what we call an XDG surface, which is sort of opting into desktop window management. And then an XDG top level. And then an XDG top level, which goes even further to say, you know, this isn't a pop-up, this isn't a dialogue box, this is like a full application window. So talk to me about title bars and right click menus and all this kind of thing. And all the input flows from WL seat is the extension interface. One thing we tried to do is make it as easy as possible to spin up and create new extensions. Some would say maybe too easy, but, you know, if you want to do your own private thing for a pretty focused use case, you know, you can go do that. You don't need to go to some central place and get the entire ecosystem to support it before you can do anything. Yeah. So yeah, this diagram looks familiar, well done on paying attention because I did copy and paste that from the KMS one. It's exactly the same thing. Like there's been no fundamental change from your attaching buffers to the container and then it's getting combined by the compositor. And again, you have that timing flowing backwards. So that's the compositor transiting all the way from the display back to your client to let you know how the pacing's going, when is a good time for you to be painting so you can keep up. And yeah, that's going all the way through. So we never have any fundamental collision where we need to resolve different ideas about timing because it always flows backwards from the display. And again, the loop is about as simple as you think it might be. You create a surface, you annotate it with some metadata. The server tells you, I think you should be this size. You allocate some buffers through generic high octals that we don't have, but that's the next bit. You draw some stuff, you tell the compositor to display it, then you wait around until the compositor sends you an event saying, now's a good time to paint your next one and that's what you do. In the steady state, the initial setup can be a bit verbose and a bit noisy because of all the extensions, but in the steady state, it's that exact same workflow as KMS. There's nothing more complicated you need to do as a Wayland client. And yeah, there are a ton of resources out there like the Wayland book and things like that. Do please use a toolkit if you are writing an application because they've done it and they've bottomed out all of the corner cases. But if you do end up digging into things or if you have something different, like say a media player where you need to be interacting with Wayland directly and providing it, for example, GStreamer does this. There's a few tools like Wayland Info is a good introspection just to tell you all the composite capabilities and give you an idea of what it actually supports. WL Hacks would be the most useful tool that no one knows about really. It is an intercepting proxy that sits between your client and the server. It dumps out all the requests. It keeps track. It's quite smart at keeping track of objects. So you can go and trace what your surface state is and what the last events were. And it's good for trying to figure out corner case bugs. Western's got its own toolkits which just try to surface as much information and logs as possible. And then, yeah, the usual starting place for my first client ever is the simple clients in Western which are completely self-contained just examples of how you would go about doing it before you get to all the complexity that you end up with in real apps. And yeah, just quickly through the GPU side. Yeah, so GL is just that pure rendering API. You're pushing in the vertex data. You're pushing in the textures and the shaders. They're really, really good at doing this because they are insanely parallel. All of your vertex shaders that you might run to deform your geometry and change the presentation and rotate things, whatever. And all of your fragment shaders which run to fill in all the pixels and do your cool lighting effects. They're all completely parallel, asynchronous and independent. And you have to program very, very differently for them. In the CPU world, you're thinking about doing things once, making sure you can share as much as possible so you're not duplicating cycles and you're not exhausting memory. GPUs want you to do the opposite, really. Any synchronization point is where things start falling apart. They want to free run across the entire thing without any interaction between themselves or between the outside world at all. It's just straight line ALU ideally. And yeah, the powerful goes with being power hungry. So even though they're really great and super capable, a lot of our display work is focused on trying to avoid them because we can get real power savings by avoiding them completely and being able to use the more fixed display pipeline where we can. And yeah, so break it down a little bit. Originally, in the STI days, you had a single output and your clients would say, I'd like to draw here on this output. They would bang the video card directly and that rendering would just turn up immediately. And this is obviously not what we want today. So we've had this sort of slow crawl of GLX was the thing that introduced the idea of, hey, what if we have multiple clients and we have a server that has ideas about where those clients should be and what they should do. So GLX just forwarded all of the GL commands to the X server and said, okay, you do it and figure out what that should mean. Not obviously it wasn't super efficient. So DRI, which you'll see all through Mesa, EGL, any of these projects as an acronym. That was the thing that first got clients access to the GPU directly rather than conflicting. And then DRI2 was the evolution of that, which had them have their own memory areas like we do in quite a modern system. And they used everywhere, but they're just shorthand for accelerated rendering or modern accelerated rendering. They're pretty meaningless as terms now. And yes, so as I said, EGL is just that abstraction of GLX. It's that glue between GL and the outside world. And that's the place in clients where we get this concept of frames that are timed and pasted and synced to the refresh rate ideally. It's not great because it doesn't have any kind of event infrastructure and it really tries to keep you from knowing what's going on, which is where GBM comes in to give us that side channel and why as a compositor we need to use GBM. And again, it's just that sort of symbiotic relationship between them and that separation of concerns that we have between them. Vulcan is also the same, like Vulcan is GL and WSI is EGL. It's a better API if that trade-off works for you, but it might not work for you, but all of the concepts are the same. They're completely applicable between both. And then DMA buff is where we get to share buffers between our different subsystems and our different processes. So that's just being able to export a BO out to a file descriptor. So you can pass it around. You can import it into video for Linux if you want to encode it later on. It's the common interop. So every subsystem in the kernel, every high-level or middleware API has the support and has the bindings for handling DMA buffs to exchange the buffers everywhere. Yeah, so when you put it all together, it's roughly as described. So this is just a quick recap of you're connecting, you're creating an EGL window, the server's giving the client some hints as this is how you should be allocating it really. GL's rendering and then DMA buff is what's transiting over to the Wayland server. The server is exactly the opposite. It's getting the DMA buff. It's importing it into the GPU in its context or into KMS if it wants to use the display and figuring out how to realize that once it's got it over the interchange barrier. And if it's taken home, it's going to use GLES to composite the whole scene or if not, it's trying to avoid it. Yeah, I think I'm running out of time apparently. So I think I've largely covered this, but yeah, it's really hard to tell in advance that we can have that static and useful display controller pipeline. So at every point in the stack, we've got to keep on transporting these fairly dynamic hints about maybe you should reallocate in a different format or try applying scaling first, something like that, in order to actually get that nice feedback loop. Yeah, so KMS again, that's just a brute force thing and that's what we put a ton of work into, into Western to achieve. And the others are catching up as we're catching up to them and in some other aspects as well. And yeah, I think the main thing to take away with how you use GPUs is that you need to really target it and make sure you use them for the right things. If you have a really short super bursty workload that can be completely dominated by the setup time in the overhead, you don't want to be branching and serializing. Memory can kill you as well if you're copying into special areas. So you really need to think very carefully about how you use them. And with that, I think we are in fact, overtime. So yeah, I'll have to leave it there. Thanks very much for all your time and yeah, enjoy.