 And so our next speaker will be Sebastian talking about, I guess, 3D in Gino. Okay, hi. So I'm Sebastian. I'm working at Gino Labs, and I have been doing this for the last, I have almost 10 years. So in the summer, like two years ago, we got a little project founded that we were able to look into 3D acceleration with a, like, a view on GPU virtualization. And so, since we are Gino Labs, we decided to have a deeper look in it and try to get something working on our, like, microkernel component-based system. Hence the very complicated title. I maybe should have called it 3D acceleration on microkernels. So, short outline, as usual, short motivation, then I try to explain what parts do we have on a GPU that needs to be multiplexed and shared by multiple clients, how we do it in user space or in Gino, especially, and then adjust the results. So it's going to be a pretty straightforward talk. Come in, please. So, like Tim Sweeney says, during the last 20 years or something, GPUs have become more and more important and way bigger. Like, they used to be very small, very slow, but in the future it may be, it may even look like this. So this would be an Intel Skylake core with integrated Intel graphics. If you just look at it, you see that's the GPU, and that's your quad core processor. So maybe these things are kind of important, and since your browser and everything now uses this thing, like multiple applications will use it, you probably should, like, take care of it. So another thing is GPUs have become really programmable. So in the past, you would say, okay, draw these polygons, draw these vertices and stuff like this, and now you have something called shader languages. So you can have this code. The driver will compile it at runtime into GPU, like machine instruction set code, and this can produce something like this. So this was taken from shader toy. And this is, like, executed by a browser, right? This browser application. You can write your shader, compile it, and it will be there. So this would be some disassembly of the code that's generated. So it kind of looks like x86-ish, like with vector stuff and this kind of thing. So one idea we had, maybe we should treat GPUs a little in a way like we do, let's say CPUs or virtual CPUs or something like this. So, okay, can you hear me? Actually, should I try to talk a little louder? So I have a very low voice, so that's very hard. Okay, so that was the motivation. And what we did next, we just had a look into, like, a modern graphics deck on Linux. So what does it look like? So some people might be familiar with it. On top, you have your 3D application, and then you need some sort of 3D library, let's say OpenGL. So Mesa supports OpenGL. And OpenGL consists of one or more layers, depending on what you want to do. On the left side, you see the embedded GL, that's the integration into the rendering system. There you say, okay, OpenGL, use this window, use this surface, and here's your frame buffer on things like this. You say, what's the bit depth and all these kinds of things. Then you have your natural OpenGL layer, the application programs against, and a back end called the direct rendering infrastructure. So, and here you have, for example, an Intel back end and a software rendering back end. And of course, the Intel back end will eventually talk to the kernel driver and render the graphics. But the first thing we did, we had to look at this whole thing, like to evaluate the complexity of this. So we looked it up and just tried to identify the parts, which are important if I wanted to support an Intel graphics course. And you can see like the whole Mesa stuff roughly consumes 300K lines of code, including a shader compiler, so it's highly complex. And you really don't know what they do. So you don't really know how they translate shader language and what instructions on a GPU because this stuff is not really documented that well. And then of course, we will talk about this later because that's the main point. The kernel driver, which actually schedules the general code onto the GPU. So, but first, we tried to get this Mesa thing working on our keynote system just as a proof of concept. So we said, how hard could this be? And we don't want to use the Intel stuff yet. We want to use the software when we were first. So we kind of put it over to keynote and how this usually goes is, you see a lot like pictures like this. So then you just work further and you see things like that. And eventually, you see your demo, standard-gear demo in software-random. So after this, we decided, okay, let's follow the actual GPU path, which requires enabling the Intel backend for the direct rendering phase. And then you have this library called the Direct Rendering Manager. And this library directly talks to the kernel. That's the kernel interface. And sends jobs to GPU. So basically, what this library does is it tells the kernel driver, giving you some memory from the GPU, giving you some, they call it buffer objects where you can store your image data in a vertex data. And this driver fills it with compiled code with image data with vertex data, sends it to the kernel and the kernel will execute it. So the big question arises, how does the kernel GPU driver part actually multiplex stuff? Because what you want to have is, if you have multiple clients in user end, you want to kind of separate them so they cannot cross talk or anything. We looked at the kernel driver and we saw a lot of outdated code, like for older Intel graphics devices. And features, we didn't really want to support. And this is why the 100,000 lines of code is like not fair comparison because we just wanted to do the new sort of GPU generations. And we just wanted to take it once of new features. So the question arose, is this feasible and how complex is this? So next I'm going to talk about the things you actually want to multiplex and the features we found that we want to use. So that's going to be a little bit more technical but you don't really have to understand it but I kind of will give you an idea. So this is how Linux does the multiplexing. Here are two clients and Linux multiplexes through the DRM library. So you have a DRM context in the kernel for each client and the client tries to identify the context and separate the coding. So let's see here. Okay, so in the GPU you have like one or more engines, they call it engines. The most important one is the render engine. This is the engine where you send your 3D operations and your general purpose GPU processing stuff. The next important one is the glitter engine. This is meant to copy like very fast copying pixel data from one buffer to another. And then you have some nice to have things like V2 decoding support or something. Since we wanted to constant render 3D we just looked at the render engine in our scenario. Okay, so now it really gets more technical. So basically here you have your GPU there you have the render engine and the only way for the CPU to actually access the graphics memory is through a so-called apparture. And apparture is some MMIO window, so memory in your memory and the CPU has to map this memory through the apparture. And this is done, I will come to this thing called the global GTT, the global graphics translation table. So of course, like in real CPUs, GPUs have page tables. There are all ones called the purpose page tables and we will come to this next. So you can see the render engine actually can access both page tables but mapping of graphics memory only goes through the apparture. Okay, here we go. So every graphics card has this global graphics translation table and there is just one and it's shared by all GPU clients. So if you want to map something through there, like through the apparture, it has been handled by the GPU driver in kernel. So here's to make sure that there's no cross-talking or whatever. On the other hand, you can say, okay, each client can have a per-pressives page table and this one is hierarchical, like we know it from x86 CPUs and can be associated to a client. That's the important thing. So the format of whatever doesn't really matter for us now. So then what you actually really have to share is the apparture because it's limited. So if you want to like map your image through the apparture, you have to make sure each client only sees his things in the page table. There's like almost minor thing. It's called fencing because the GPU doesn't store like image data linear in the memory. It just tiles it and stores it non-linear. So you can like detail and tile it through the fancy with registers to just say something like, okay, this region is x-tiled and the GPU will do the rest for you. These things are also limited. There are only 32 registers with multiple clients who somehow have to share or by a check like doing partitions or dynamically assigning it to clients. So these are the most important parts you actually have to share. And the other thing is the rendering has some sort of command buffer. It's a ring and there you can write GPU commands too. Like let's say something, a batch buffer is like a buffer containing GPU commands and you say, execute this buffer. And in the past, all the clients shared this hardware ring. So this has changed too. So in the path you had this hardware ring you would write your commands in there and there was a head and a tail pointer and it would say, okay GPU, new head pointer and GPU would execute it until it reached the tail pointer. So now in your GPUs they have this notion of exec lists and maybe the best comparison of the exec list is something like you're familiar with VTX. They have this VMCSE area where they store the whole machine state and this is kind of comparable. So you have per client, you have this exec list and in there is a page table pointer. There's like the ring with the commands the client wants to execute. There's a hardware context when the GPU stores like if you pre-empt the context the GPU stores this whole state in there and there's a hardware status page where they can say, okay what's the state of GPU? So you can have this per client and the only thing you do now is all right let's write our commands to the logical ring set the page table and then say to GPU, okay execute it and the GPU will switch the page table copy the logical ring on the real hardware execution buffer and you're basically done. So you have to make sure that there's an interrupt or something in there because what you still can do is like okay I have a shader with an endless loop so you cannot do anything about it the only thing you can do you can have a watchdog timer says if the GPU doesn't come back because of some malicious attack you have to reset it. So that's the only thing you can do at a time. Okay, so now this is kind of what I just explained and how we did it on Gino on Gino we have this notion of sessions so you have like your GPU multiplexer in user space so not on hardware but in kernel space and you have this session interface where a client can say okay I want to have a GPU session and the multiplexer will allocate all the necessary resources for you and you can like submit your command bubbles like you would do in DRM just in user land and through a different kind of session so here you have like N clients and this is the information every client has to you know be provided so and as I said you have to slice the fences and the aperture there's a page table per client and a hardware context okay so this would be like put in this in the Mesa context we had before so we still have our 3D application we say this whole Mesa thing is for us it's way too complex and way too large so we link it into the client application and we exchange the DRM library back end so it will talk our GPU session talk so and establish a GPU session to the GPU component and you basically have a user level component with the component driver which cannot crash the kernel at all okay we start the portal again and writing the multiplex again this time you could actually could have seen something and the gears were there for like a second then they were gone turned out we didn't do the synchronization well so the rendering wasn't done and we already painted the picture so most of the time you would get like portions of it or nothing this is what tiling looks like so if you don't detail anything you always get this nice lines there so then you know you did something wrong and the other thing that's really hard to do when writing something like this I mean they got pretty good documentation until like 10 volumes or something but you cannot do without the Linux driver because there's so much missing some little details so then have you flush I'm almost done so then have you flush to flush the cache or you know so this little there are croaks for some devices so this took us a long time to actually achieve so in the end we ended up with a demo scenario like this so this would be a G-node demonstration component system it's a very small one so you got your root task here it launches our GPU multiplexer slash 3D renderer this is not really necessary but just for demonstration then we have our frame buffer must come from somewhere right now like our real hardware frame buffer and our nitpicker GUI multiplexer so in the launcher this would be one 3D subsystem we'll start a 3D application and we have this nitframe buffer which connects to our GUI server it gets like a frame buffer session so the multiplexer in the session the renderer offline picture into the 3D application and the application will copy it to the nitframe buffer and the buffer will multiplex on screen so you can have multiple applications on one screen ok results already shortly so we are actually pretty proud that we could we are able to write this whole multiplexer in 10,000 lines of code but this is just broadwell so for the Skylake and Kabylake you will have to do some more stuff for Skylake for example you have to load microcode again it's the new trend at Intel microcode for anything even Wi-Fi so and the whole like trusted computing base really gets smaller so we are not able to add the Mesa library to the trusted computer base because it's too complex and way too large ok so coincidentally we will be supporting Mesa for broadwell CPUs in the next step we want to take advantage of the Blitter engine so we can copy pixel data very quickly we do it in software right now as I said Skylake Kabylake support would be nice and maybe we look into a GPU virtualization but we don't know because Intel has something called they call it GVTG Intel graphics virtualizations technology or something and there they have a device model internal which is kind of interesting so you don't need to implement your own graphics device model ok some references for if you want to look it up online and it's demo time I think ok alright so this is like real old demo here you can use it a lot so let's see what this does some standard Mesa demo stuff and to see that we can have two there would be our gears so these would be two separated clients at this time and we did some more sophisticated things we ported a shader from shader toy like one to one we'll see if it lives so this is one dot compiled at you have the shaders compiled in the machine code load on the GPU and execute it so that would be free so let's see if we can do four yes just because you can alright that's it ok the question is if we have some priority handling we haven't implemented that but it's you kind of schedule context you can implement something like this it's not a problem you can say what you actually can nest next in the driver so you have yeah you kind of can can you set the car and you say there is some notion of preemption but we haven't really found that yet but the question was if you can stop rendered work if it's rendering we wrote a backend for the EGL so did you try to did we have modified the Mesa library no because it's too we didn't have some changes upstream because it's too old by now so we ported 11 to and since then we are now at 17 or something they changed the versioning yeah we just have a new EGL backend and we changed the DRM backend but DRM is not Mesa so the question is if we could communicate to the Linux community yeah maybe maybe yeah but you know the Linux community is so big and Intel has how many people 150 in open source so they are doing things their things and until now we were the small guys so I don't know we could try is copying to the frame buffer the question is if the output of the graphics causes copy to the frame buffer right copying of the the image to the frame buffer is it possible to use dpd dpd for that too so he asked if it's possible to use gpu for pixel copying and yes you use the Blitter engine that's what it's there for and we want to use that in the future but it was out of scope of this kind of work and I mean you can copy 3 gigahertz per second on a model CPU so it's sufficient more questions project for this did you do this for showing that the geno can do this so do we have some project for this yes so we want to use geno in a daily basis we use it on a daily basis we want to have 3D acceleration so it's going to be integrated into the main geno our scenarios someday hopefully pretty soon and our thing is we want to look into this graphic cards virtualization thing and it would be cool to have like a virtual Linux or what which could take advantage of 3D somehow and all the approaches right now up until now were not really good yep so are we looking into our gpu course yeah we could look into AMD there is documentation there and we are of course in no go because you know Nevo and all this thing yeah but it's I mean this took like too many years to do so you don't do it like every day and you don't do it without finding it's a little hard for fun yeah if I can maybe briefly speak to the last point because you had a question and I actually worked for AMD on our graphics drivers so the mesa part I think you can just take one on one like you did for Intel and it's the right way to go for the kernel stuff, multiplexer stuff you really are biting off a lot of stuff and the drivers are huge trying to reimplement them there are lots of people working on those and if you can find any way of taking the code that's in the Linux kernel and maybe somehow adapting it that would probably be the way to go I know it's not the microkernel philosophy so the guy from AMD said it's hard to get the kernel part working he said the user part is fine and what we should do is like try to port the kernel part so guess what we did with the Intel driver first no we did it on this one we had Craig wheel yeah but yeah that's good okay I think that's it