 I'm currently working for Broadcom, working with the Raspberry Pi developers, but in the set-top box group within Broadcom. Oh, no. Where's my focus? There's all tap just now. There we go. OK. So I'm going to talk briefly at the start about the Raspberry Pi architecture. Made it hard to get open sourcing to happen. The previous software architecture that was based on that. And then the architecture that I'm building instead and some of the challenges we face doing so. So the Raspberry Pi has a 700 megahertz ARM CPU. This is ARM V6. It's a little funny. It has hard floats. So groups like Debian have a hard time dealing with this architecture. So Raspberry Pi has fourth Debian. They have their own Raspbian distribution, which they get to load a bunch of other software in, but mostly recompiled to the particular ARM architecture here. So attached to this ARM, there's the VPU. In previous generations, the VPU was the 3D engine. But on this one, the VPU actually kind of has this fairly limited role. It does the hardware encode and decode stuff. More importantly, though, when you boot the system on, it's the chip that turns on first, loads an image from the SD card, executes a pile of code to turn on a bunch of ARM clocks and things, load the bootloader, hand it off to the ARM, and get the ARM actually running. So the VPU is, in a way, kind of in charge of things. Attached to the VPU, now there's this separate QPU engine, which is the actual GLS2 3D part. So a lot of people talk about the VPU because if it's doing 3D, there really is separate hardware doing the 3D engine now. It is just a GLS2 engine. It doesn't do anything but that. It is really just GLS2 and OpenVG. Also, because this is a mobile part, it's a tiled renderer. It seems like that's just kind of the way everything gets built for mobile pretty much. Keith is shaking his head. Vivante is not tiled. Oh, Vivante is not tiled. OK, I've forgotten that. So the software architecture. Because the graphics had always been done on the VPU previously when they built the new chip, the graphics was still done on the VPU. It's a custom vendor driver. Their OpenGL stack is something that was written in-house. It's closed source. It was. And that's the part that has the job of generating the shaders and the commands stream to hand off to the QPU engine. So the bulk of the driver code is on the VPU, doing all of the work of being an OpenGL shader compiler. So on the ARM side, what you had was this thin little library that would take the incoming GL command stream, put it in some buffers, and hand that command stream off to the VPU to get the actual work done. There was a three clause BSD code dump back in 2012. But the community didn't find it terribly useful, because the kinds of things you want to hack on as a driver developer are that shader compiler and all of the interesting stuff inside of there. So having the ability to hack on the emission of the command stream that's basically just a translation of OpenGL ended up not being terribly interesting to us. So how I got involved. I'd been an Intel graphics driver developer for eight years, I think it was, working on much larger desktop chips for the most part. But I eventually got to the point where I was really excited about fixing Android graphics, actually kind of specifically because my phone is super unstable. It's like I've been working on Linux graphics for so long, and Linux graphics I'm using constantly is crashing. How can I go fix this? Most of the vendors are still fairly close source oriented. But back in February this year, Broadcom released a giant pile of code. They released a spec for the chipset that I took a look at. And it's like this looks like a really good spec. It seems to cover pretty much all the stuff. It documents the weird constraints fairly nicely. And they also shipped a three clause BSD source, pile of source code. What happened was that some of their other chips that they build don't have the VPU part. So the closed source driver developers working on supporting those chips had to port all that VPU code over to the arm to run their driver on those chips. So the Raspberry Pi guys figured out that this had happened and said, wait a second, we could just ship an arm side driver for approximately our chip. Get the open source code out the door. The downside of course, given the history of this code is that it's Android only. It doesn't support GLX or EGL, a bunch of this stuff that we sort of expect as Linux desktop users. But I got really excited about the opportunity when, you know, found a position online, applied to it, even though it wasn't really for what I wanted to do. And they said, oh, you want to do that? Oh yeah, can you come interview? So I joined back in June this year. And pretty much got to decide how I wanted to build the project. They sort of trusted me to make all of the technical decisions here. So yeah, I get to build a free software Mesa driver. It's MIT licensed, it runs on the arm. And I'm also building a free software DRM kernel driver. My goal is to get this into the upstream kernel. I'm gonna talk a little bit later about some of the challenges we face with that. And I'm targeting X for the most part. What I'm building should be usable for Wayland, right out of the box. For Android, you have to build the little hydraulic part that nobody has built yet, but it won't be that hard. But yeah, I'm using the XFace 6 video mode setting driver that Keith and I have been working on over the last year, couple of years. I talked about it last year and Keith talked about it this year. It's been a while. So one of the neat features of having so much support from my vendor is that within like two days of joining, they had handed me the simulator source code. So we've got this little functional simulator that emulates the chip on an arbitrary CPU and it has like four little entry points to it. There's allocate the memory and memset all the things. There's translate, generate a physical address for a chunk of the memory, generate a virtual address for one of these physical addresses and there's execute some code. So I took this library, built it into an i9-65 driver, which looks as far as the libgl loader is concerned like the i9-65 driver I used to have, except that when it takes all these buffers from the window system and needs to go do some graphics drawing to them, instead it copies the contents out of the buffers, runs the simulator on them and copies the contents back into the window system buffers. So I can run GLX gears on my laptop under simulation and see what the output would be on hardware without having to actually run the hardware, which is a little bit slow. What I get out of simulation though is really awesome stuff that graphics drivers and driver developers have always wanted. I can print out the registers from my shader as it's executing. You never get access to look at your registers because there's no input and output from the chip, there's no debugging facilities on our chips for doing this. You would, to debug you would do tricks like, oh, at this point in my shader, I will store the contents of this value some more special and then I'll paint that value as the color at the end of the shader and hope that nothing else like stomped it in between. It's all very custom ad hoc and it would take you a long time to do any single debug step. Now it's just put printfs in the simulator and look at the register contents at every single step. I can GDB when the GPU crashes, which is pretty awesome. I can Valgrind to see when it's reading undefined texture memory. Not also I can sometimes forget to test on the real hardware push regressions. So I mentioned this with a Glass 2 part. Glass 2 is this kind of fork of desktop open GL as of about GL 2.0, 2.1, where they cut out a ton of features. A bunch of the features are things that were silly and nobody should do anyway and we're glad they're gone. And a bunch of the features are things that you kind of miss. Like why did we like those? Why did you take that away? But the result of having just a Glass 2 part is you end up with like 110 page hardware spec, which is like, you know, I'll just page through that in an afternoon. The first hardware I worked on eight years ago was 1700 pages. I didn't try to measure how many pages my current hardware spec from Intel was because it was just too many files to even look at. So yeah, my hardware, there's like nine little state packets for open GL, six state packets for draw calls, eight state packets for the, you have to emit the binner command stream to decide like which rectangle to draw and which order. Simple code. I actually got started with a piece of demo code from Scott Mansell that was 340 lines to draw a triangle. You know, within a couple of days I had this running on my DRM kernel driver drawing triangles. I would have been really happy if I had a triangle in three months, I'd say. The downside to simple hardware is that you end up doing a lot of work in other places. So parts of open GL that would normally be in fixed function hardware and have a bunch of documentation about how to set these various bits to enable various bits of fixed function hardware, I get to emit code to my compiler to do all of that work. So vertex fetch format conversion, whether you're taking in eight bit per channel normalized values or you're taking in floats, well I have to figure out how to unpack the eight bit normalized values correctly. All sorts of stuff, user clipping, shadow mapping, blending, logic ops, color masking, point sprites. I think I found all of the things that I'm doing but I'm not even sure. There are so many fallbacks happening where I'm emitting a bunch of code to deal with a feature because it's not fixed function. Of course, if you have plenty of shader time, plenty of cycles in your shaders to spend compared to your memory band, if you don't really mind that you're running CPU cycles kind of to get this work done. But some things end up being harder. So I mentioned before that I'm trying to build a GL driver, not just a glass driver. I think this is gonna be a really big feature for users. Porting desktop GL software to Glass 2 is a bunch of work and that's assuming that you start with desktop GL 2 software and not desktop GL 1 software. We still in the open source community have a lot of GL 1 software out there and porting those to Glass 2 is kind of hard. You have to write a bunch of new shaders and figure out how to get shaders to even work in the first place. So what we have is a bunch of support in the Gallium driver for turning desktop GL features into Glass 2 features. So when I'm handed Quad Primitives, which is four vertices per primitive, I'll make a little index buffer that says do this one, this one, this one and this one, this one, this one in two primitives. Okay, that's pretty easy. If they hand me 32-bit index buffers, well, Glass 2 only does eight and 16-bit index buffers. So right now, I just take the buffer they hand me, go and pick out all the elements, chop off the top 16 bits and hope. There's an assert in there, so at least I'll know when it happens. This is. Yep, okay, there we go. Thank you. Yeah, so we expose occlusion queries with zero bits of precision. I have to do shadow map texturing in the shaders. I have to wake up Keith's laptop. And there's a bunch of stuff I haven't done yet, where we have test cases showing that the driver's broken for these, but yeah, it's not that important quite yet. Polygon and line stippling, weird features that nobody uses. Polygon fill modes, I know software exists that uses this. Mostly it's in the sort of like CAD area. And maybe not doing this on your Raspberry Pi, I don't know. 3D textures, I have a really horrific plan for emulating these in a bunch of 2D textures. It's not gonna be any fun. Okay, yes, I will have fun, you're right. Derivatives, there might be enough hardware support to manage this, it's gonna be ugly. LOD clamping, I have no idea. It's not that important, other than the giant piles of test cases that fail. So that's sort of the general feature set problems I face. This has also been an interesting little architecture to play with. So each of the instructions that I generate in my compiler has two separate operations. One of the operations can be knob. There's one that's ad-like stuff and one that's multiply-like stuff. So a min or max operation is an ad and there's like a couple of mole operations. It's mostly moles because you do a lot of multiplying in fragment shaders. Each of these operations has two arguments. The outside is that there's only one address into each of the register files for your arguments. So there's a tiny little MUX field that decides for each argument whether you're picking from the A or B file and then a bigger field for where in the file you're reading from. And there's four other accumulators. So you can get some pairing of operations but your accumulators are kind of precious because you wanna use them for all the things. Also there's no ability to spill registers which is kind of painful. Sometimes I just have to say no, I won't draw that for you. Here's an example of one of the instructions that I pulled out of GLX gears, I think. So in one sort of cycle, we're going to do an ad operation reading from R3 and writing to A0 and we're going to square the RB file number two into R3. So I can stack up these instructions in pairs and get better throughput. This is hard for your register allocator. Our register allocator, we use graph coloring in Mesa for the most part. And this is that thing where you say that there's a bunch, you generate a graph for all of the values you need to store and based on their live intervals you put edges between them indicating where you can't store two values in the same register at the same time. The problem is, as I'm choosing, once I choose a particular register for a value that changes which set of other values I'm going to interfere with because for individual operations, if I've already got a particular register in the A file, my other argument can't also be in the A file. And the file restrictions change as you assign individual registers. So for now I have this gross hack. I'm saving two of the registers, one from A, one from B. And when the register allocator picks something unfortunate I spill into the file I need and use it from that file. Generates extra moves, it's pretty ugly. I do force some values into the A file only because some operations require A file. I can say, yeah, that particular value I will just always store it in A. Once I've register allocated I generate a stream of single operation instructions. So if it's an ad, it goes in the ad and there's a knob in the mole or there's a knob in the ad and there's an actual mole instruction. Once I've got all of those individual QPU instructions generated then I have a little instruction scheduler that tries to look at them and say, okay, you've got a knob in the ad here and a knob in the mole here and you aren't both trying to use something from the A file with the conflicting index and piles and piles of code trying to pair these things up. It's fairly effective. I don't have metrics on them but I think I cut instruction count by probably like 20 or 30% overall over the course of adding all of the pairing support. But this isn't the only way I could do things. There are a bunch of improvements I can imagine to the register allocation to make it better for this architecture. One way would be to give the driver the opportunity to just choose something better at the moment we're selecting. So the register allocator right now as it's choosing each individual register has two heuristics available. One is choose the lowest number of register that's available for you. This generally works out pretty well because we keep our accumulators in low numbers so you tend to use your accumulators more often. The alternative is for nice each five we added round robin. So just choose the newest register that you haven't used recently which improves their scheduling. What if instead I handed the driver, here's all the registers that are available. Would you prefer that this value be stored in any one of them in particular and the driver could say, oh, this operation has a bunch of things already, or all of the operations using this value have things assigned to A. I'm gonna choose a B for this one if there's a B available. Alternatively, I could try something more global doing a pre-pass splitting things into each individual value would be an A or B. Generate a bunch of extra moves up front, do the assignment, forcing things into A or B and then noticing that, oh, a bunch of these moves are from the same location to the same destination and cutting the moves out at that point. That might work. The other option, this would be, this is kind of based on what they do in the code dump from Broadcom. They just do a bottom up linear scan allocator. So just generate instructions from the bottom to the top, pick the node for your destination at the moment you're picking, at the moment you need to generate your sources, kind of look at the tree and decide where that source should go in particular. I haven't been putting a whole lot of work in register allocation because if we switch to some of the new SSA architecture that's being worked on in Mesa, there are some really neat things you can do with register allocation directly out of SSA. I haven't looked into them enough yet, so I'm not putting too much attention on register allocation. So this Mesa new IR thing. Right now there are a bunch of stages that your GLSL code goes through internally in Mesa before it gets onto your GPU. GLSL IR is this tree structured IR that we generate out of the parsing of the GLSL source code. We have a bunch of optimizations at the GLSL IR level. That gets turned into TGSI code, which is this serialized packet format that's really awful to work on for any optimization. Oh, thank you, Keith. So TGSI is kind of mostly just a format for streaming shaders between things. It's not very good for drivers. I turned that into a little QIR inside my driver, a tiny little IR that I can do my own optimizations on because TGSI has inserted a bunch of silly things for me. And out of the QIR, I do my register allocation and generate QPU code. My QIR is SSA, which is really easy to do when you don't have any control flow and I don't have control flow because, well, it's a GLS2 part. I shouldn't need to do any, right? Oops. Turns out that our loop unroller doesn't manage to unroll all of the loops in the GLS conformance suite. And you look at the loops that aren't getting unrolled and it's like, yeah, I could unroll that. Why doesn't it do that? I don't know. But I still think that doing control flow is probably the right answer. It would mean that we can run a lot more desktop code which doesn't have the guarantee that all groups need to be unrollable. So I should probably do this. And the hardware has support for it. It's just not being used in the original Broadcom driver. There's this new thing though, NIR, NIR, it's the pronunciation of it, that landed this morning in Mesa. I've been working a little bit with Jason Extrand at Intel, who's been doing a bunch of the work. NIR was actually originally started by Connor Abbott, a high school student who hung out in Intel for a while and wrote a new compiler architecture for us, which was pretty fun. So I've got this little branch in my tree that now inserts a couple more steps in the compiler pipeline. It takes GLSLIR, turns it into TGSI. From that, I turn it into NIR, then turn it back into TGSI so I can feed it into my current driver architecture for turning TGSI into QIR into QPU code. This is probably more steps than I actually want. Obvious next thing would be to get rid of that extra TGSI step because part of the whole point here is that I'm not a fan of using TGSI. My goal with NIR is I wanna be able to write optimizations for the compiler in shared code. All of us driver developers keep writing the same optimizations in our drivers, and it's not that fun the 10th time you've written a copy propagation pass. If I could just do that once in NIR, I would be really happy. So the NIR directly into QIR is almost working. Maybe sometime in the future, we would drop the TGSI in the middle because we have GLSLIR to NIR already for the NICT-5 driver. What if I fed that NIR directly through the gallium interfaces somehow into my driver? That would be pretty nice. I don't think anybody has any concrete plans for doing that yet though. Moving away from compilers, by far the most difficult part of working on this hardware is that there is no MMU on it. Everything else I've worked on since I can remember, which is a long time ago, has had an MMU under the GPU so that the driver can build a bunch of page tables for mapping the memory that needs to be accessed by the GPU from the user into the GPU. In our case, the GPU has direct access to system memory. My addresses that I generate in my command stream are physical memory addresses. They would be DMA addresses, but I think they're the same. The kernel CMA support helps this a lot. This would have been really hard to do a while ago, but these days I can just ask the kernel, hey, can you give me 256K of contiguous memory? And the kernel can actually say yes now. The downside is that that's allocating out of this pool of memory that's kind of half saved for just CMA's own usages. So that's kind of unfortunate. I can't really allocate out of CMA either too much memory. A, my system doesn't have a lot of memory up front, and B, the CMA pool is a fraction of that total system memory. It's also fairly slow to try and make those allocations. But the worst part, even then having to do contiguous memory allocations is that this is a giant security hole because you can just ask the hardware to fetch from a texture that happens to be any part of kernel memory. Anything the kernel is hanging on to. With the, since obviously the hardware can draw to memory, it can also store values onto your page cache if you wanted. So in the closed source driver stack, they just didn't deal with this. Basically, if you could manage to hand some shader code off to that stack, it would execute into arbitrary system memory. You could hand it bad vertex data and read too far and read out of things you shouldn't be. What I built for my solution, though, is in kernel validation of the command stream. So I actually have to read through your shaders, figure out which uniforms are being used as texture sampler setup, and track those out of that. I think then look at your uniform stream and make sure that in those, there are actual good addresses for textures and that you're not trying to set up a texture that's too big for the beo that's being referenced. I can cache the parsing of the shaders. I can't cache the parsing of the uniforms because those change all the time. Also, your command stream changes all the time. I have to look at all of your vertex reads and make sure that they're bounded correctly. I have to look at the tile buffer address and make sure that that's not going to scan, not going to draw to anything it shouldn't. Actually, surprisingly, this only takes about 5% of the ARM CPU time, which granted is a lot. I don't have that much CPU time. This kind of hurts. I wouldn't have been surprised if this was a lot worse, though. However, this is the scariest code I've ever written. 1700 lines of just a long series of, if the user tries to do this, say no. If the user tries to do this, say no. If the user tries to do that, also say no. All of which is predicated on me correctly understanding how the hardware works, that I have a kind of perfect image of what this hardware does and how it responds to all the different ways you could set up your state packets. I have never had that understanding for any hardware ever. A few other notes about my kernel software. I'm using the gemCMA helpers that were sort of generalized out of some other drivers. I put a thin little wrapper around that to track where it is in my GPU command stream. It's like I have a sequence number and I have a couple of list pointers for the BO cache. I have to have an internal BO cache because that binner step, the part that's taking your incoming vertices, deciding where they are on the screen and building up little per screen area command lists for draw this particular set of vertices for this tile, needs kind of an arbitrary amount of system memory. You could construct some sort of function that would figure out how much memory you needed up front and that function would answer way, way too much memory. So you do runtime allocation of memory in response to interrupts. Since I'm doing runtime allocation of memory and CMA is kind of slow to allocate out of, I need a kernel cache to do this. My driver interface though is really, really simple. I have three eye octals currently that are driver specific. I have submit command lists. I have wait for a particular sequence number to pass. This is the fencing idea that you see frequently. And I have wait for a particular BO to be idle because sometimes you don't know what sequence number was last used for rendering to this BO because it came from somebody else. Some other user passed an FD over a socket to you. So I have a wait for a BO to be idle. And then I'm just abusing the dumb allocation APIs to do my buffer allocation and mapping and Dave is probably not a fan of this. I've talked about execution so far. All I've really done so far for KMS is I'm abusing the firmware to set a dumb frame buffer for me. So the firmware has a little tiny chunk of memory that it's using for all of its graphic stuff. I'm not using the rest of its graphics functionality, but I can't ask it to set up a scan out buffer from somewhere in that area that it chooses. After it sets up the scan out buffer, I just go and smash the hardware and say, no, no, no, that mode you set, great. Thanks, but I'll scan out of my buffer instead. And this is all the filthy hacks. So it's assuming 8888, it's untiled, lots of assumptions in this code, and I really need something better. Building something better is going to be interesting. The notable feature of this hardware is that, so most hardware has a couple of scan out planes, right? You have the thing you see here, there's another plane for the cursor and if you had a movie playing, you would put that into a plane too because scaling and color conversion is hard. The VC4 instead has this HBS display list. So they came up with this fun idea, hey, we'll have a little microcontroller kind of thing whose job is to read your scan out buffer for you and it will take as input this chunk of memory that has a series of rectangles, formats, and addresses. For each line, it will decide what pieces of memory it needs to read and go and read from them and do it with the compositing itself. So my number of planes is not really limited by the hardware other than if it's a 4K buffer and my little rect format and addresses 10 bytes or whatever, 100 of them. Instead I'm limited just by the memory bandwidth in the system, right? I need enough bandwidth that I can actually answer all of those reads in time for the display to get successfully scanned out. So we've kicked around a bunch of ideas for this and when we started the atomic modes work hadn't quite landed yet. These days my best ideas will expose a giant number of KMS planes and the driver will get to try and set video, present a scene with a ton of frame, a ton of planes stacked up and if the HBS won't be capable of that it gets to just say no try again with something simpler. What can we do with this? With the present extension in X, X now sometimes asks the driver, hey, could you present your scene at this V blank number with this plane contents? What if I extended that interface a little bit to be, hey driver, please present at this V blank interval this scene composed of this plane with this plane on top of it with this plane on top of it with these formats. That would look a lot like what my Harborator face is. We'll need some sort of communication here for when the kernel says, no I can't do that for you. X could take a look at its scene and decide, well I could probably save a bunch of bandwidth if I squashed this plane onto this plane, I'll go to fire off an open GL operation to do that rendering and then try the scene a little simpler and see how the kernel feels about that. This would be pretty neat that we can build this sort of automatic put your stuff in planes implementation that would save a bunch of memory bandwidth for most people while not having to have the driver know all the specifics of the weird, weird cases that your particular kernel has for being able to scan out combinations of planes. Basically by just having the driver pick to squash things down until it gets down to a single plane. Plan initially is to use GL for this, the Glamour code that we have in place. The HVS that we have in the hardware actually has a neat little bit of magic. I can configure a scene with the output not being the output on the screen but being a chunk of memory. So if I wanna squash a couple of planes down I could configure the hardware to read those couple of planes exactly in the way that it would for scanning out to the screen and it will just, when there's a little bit of bandwidth to spare get a little of progress done on rendering that. Then once it's rendered I would ask it to actually scan that rendered content out. That's gonna be a while, we'll start trying to do with GL first. The neat thing here is that we could avoid all of our X copy areas to the screen things like Firefox scrolling and all of this by just putting most of the contents in planes in the first place. Upstream status. All of my Mesa code is upstream already. I started pushing that back in August I think and I'm doing pure upstream development since then. The kernel is a little harder and not because of the kernel upstream. So Raspberry Pi maintains vendor kernel treats. Razbian has 3.16 currently. These trees are take upstream and rebase a bunch of drivers onto it. You squash some fixes in to deal with the rebase or just because you've had some fixes since then and you wanna just roll that into the driver. I'm based off of one of those trees so it will boot on my Raspberry Pi. I'm using 3.15 at the moment. The particular version isn't that important to me. I could rebase, I just haven't. I've got a couple of hacks to core DRM to share some extra code and then another 59 commits to build up the kernel driver. I would obviously like to stop working into my own tree. I'm not a fan of working in my own private trees. I would like to just be in upstream. But the upstream tree has a bunch of stuff missing that is in these vendor kernel trees. I just saw a patch to get the USB networking working. I was unable to boot upstream actually because my configuration requires networking to even boot, I use NFS. So USB that gets the ethernet running might be about to land but there are some other features that I really need. The mailbox is this little interface to the VPU. It's the way you pass messages between them for doing things like, hey, VPU, could you set up my frame buffer for me? Also, hey, VPU, could you turn on all the clocks for the QPU for me? Because I don't know how to do it myself. So I rely on that driver in my kernel driver at the moment and that isn't upstream. There's a bunch of other features that are missing from upstream. CPU clock control sound, a bunch of miscellaneous drivers. So for me, I need to get them, mostly I need bootable upstream, the USB bits. I need the mailbox driver in order to report my stuff to upstream. But before I try and push upstream, I probably ought to fix some stuff in my ABI. The dumb create and map APIs, dumb meaning not acceleration, not smart hardware specific stuff, just things to get a frame buffer working if you don't know about the specific hardware. I'm using those APIs in my driver. As far as I know, Dave would get very, very angry if I actually shipped code using this. And there's a bunch of details of my command list setup that I chose one particular implementation. Is that the best implementation I could do? I should probably go do some profiling and do a couple of experiments before I actually finalize this ABI. And the worst thing, I need review. So my kernel driver notably has this 1700 lines of absolutely horrible shader validation code. Who's gonna read through that? And if somebody does read through that, who's going to have the attention to detail to notice that in this one place over here, you forgot to overflow check your ad, or even worse, what about the interaction of this packet with this packet? What if you put them in the opposite order and try to use some undefined packet contents? My best idea for dealing with the shader validation stuff and getting something tested and believable out of that is actually probably using Dave's Trinity fuzzer, Dave Jones's, to at least verify that, yeah, in this giant mess of code, you can't actually successfully do buffer overruns or stomp on system memory. Trinity fuzzing this thing will generate a bunch of GPU hang because you'll set up a bunch of incorrect GPU state if you just fuzz your GPU command stream. But if I can run this for a couple of days and not trash my disks, that would give me some amount of confidence. So my driver that I've built is fairly simple. Under 15,000 lines of 3D driver code, which coming from a project before that was, I think we were hitting about a hundred K lock in the 3D driver for Intel. And under 5,000 lines of kernel code, granted, I need to write a bunch more because I need to actually do native display. But it's fairly small. And despite being fairly small in my simulator, I'm at a 98.7% pass rate on the ES2 conformance tests. Those failures are mostly the loop problems that I mentioned before. There's like four of those. And there's a bunch of EGL image stuff where the code base I have just doesn't have them hooked up for my window system. I'm doing pretty well on the Piglet GPU tests. These are the OpenGL mostly desktop tests that we've built in the open source community. 92% on those, which I'm feeling really good about. Actually, one of the problems there is a bunch of math is too inaccurate for those tests. The worst problem though is the hacked up KMS code I wrote. It works on my monitor. It's forced to use 1680 by 1050 resolution, which was the monitor I had at the time. It works for me. It probably won't work for you. You'll note that I'm not displaying on the conference projectors off of this hardware. Links to the code. So I've got a sort of to-do list and build instructions up on the DRI.fd.o wiki. Hey, the spec does actually exist. You can find it. And the sample implementation. The link here actually to the sample implementation from Broadcom is the Broadcom code dump that was ported to the Raspberry Pi by Simon Hall for Raspberry Pi's Quake 3 bounty. So he actually managed to get the Broadcom dump up and running on the Pi enough to run Quake 3. I think that concludes the presentation. If we have questions. Hi, two questions. Number one, is it possible to turn off the security checking to get back that 5% if you were in a performance bound use for the Pi? Yeah, so can we turn off the security to get the performance back? It's kind of tricky because my, how do I explain this? So one of the problems we have in generating of GPU command streams is that user space doesn't know the particular addresses of your buffers. So you pass in these sort of relocation structures that say, hey, in this part of my command stream, this is actually a reference to this buffer and you need to pick its address out and, you know, or it in. Because I'm doing all of this shader, all of this validation of my command stream anyway to get the security, that's where all my relocation processing happens. So I don't have a way right now to implement my relocations without parsing all of your command stream. You can imagine kind of hacking up that code to have something like if depths that would only do most of the validation in the validation case. But I've also been thinking about rewriting my command stream entirely to sort of separate these two parts out, to have the relocations in one look spot and all of the uses of the relocations in something that looks more like the native command stream. It's gonna be tricky though. The other question was your slides mentioned X. So you're looking at Wayland at all as well? Yeah, so I've only been really looking at X. Wayland should be easy. I've got all of the GL infrastructure necessary for Wayland. I've got the prime buffer import export so that I can do DRI three. All that's in place, you know, other than bugs. The more interesting thing for Wayland is going to be building the compositor. Though there is a port of Wayland to the Pi that uses all of this HVS infrastructure I talked about by shipping these scenes off to the VPU and saying, hey, VPU, could you set up my scene to scan out all of these various Wayland planes in their various locations? We would need to port that idea to the new atomic mode set APIs. Right, so another question on the command stream validator. A lot of it sounded pretty hardware specific. I know there are a few other chips out there which are also lacking hardware memory management. Do you think there's any benefit later down the track in someone trying to create something a bit more generic with just a library of different functions or different chips? Yeah, I've kind of, I was talking about Jamie Sharpe about this a while ago. I'm not sure because, yeah, it is so hardware specific. You know, the checks are all about the particular interactions of this packet with this packet and needing to have one of these set up before this one and, you know, pick these bits out of this one and multiply them in various ways, checking for overflows and all of that. And, you know, 1700 lines of code is a lot to write and it's really ugly, but I have a hard time believing that we've come up with a generic infrastructure that was simpler. And to a significant extent, like you want something simple and trustable for this. Yeah. Last question. You mentioned that originally it was the VPU that was doing the shader compiling, but now that's on the arm along with validation. I guess the VPU is still like a kind of proprietary bit so you couldn't put it to work to do at least a shader compilation to offload stuff off that arm. Yeah, there's no interfaces as far as I know in the current VPU firmware for, hey, here's a pile of VPU code. Could you render some things for me? Obvious problems here. I believe there's no open VPU tool chain. There was a guy that was working on hacking up an LLVM backend for the VPU that I think got to the point of emitting some instructions. So A, we don't know how to generate code for it. B, there's no interfaces for it. So for all of my stuff, I'm still using the existing Raspberry Pi firmware blob. It still has the Raspberry Pi driver in it. You have to like set a flag in it to say, hey, firmware blob, stop listening to the graphics interrupts. Those are mine now. Please don't mess with that. Someday, once I get to the point of us doing this in Raspbian, we'll probably get to put out a new firmware blob that's a lot smaller. At the point where you're putting out a smaller firmware blob, well, could you do something else in place of what's currently on the VPU? There's more potential then once we get this giant pile of code off of the firmware. Eric Arland, thank you. There you go. Thank you very much from the organizers. Thank you for your welcome and welcome to Raspbian Pi.