 OK, it's 25, so we're going to start pretty soon. Get settled. So I hope everyone had a good lunch, and I hope you still have some energy for some codec-related adventures. So what I'll be talking about today is integrating hardware-accelerated video decoding with the Linux-based display stack. So who am I to tell you about this? I'm Paul Kschalkowski. I work at a company called Butlin, and we provide all kinds of services, development services, around embedded Linux in any form. So you can contact us for development, consulting, but we also do training, especially on-site. So if you have a team of engineers you need trained up on those topics, you can call us. And we try to work with a strong open-source focus. So all the things we do, we release in the free software licenses and so on. So this got me to contribute to a number of different aspects, especially the Linux kernel. So the one I'll be talking about today, especially, is the Cedrus VPU driver, but I also worked on the DRM driver for all-winner SOCs, which I will also be talking about a bit. And most recently, I developed a new display and rendering graphics with Linux training. So this is something we released publicly. It's about 200 slides. So if you're interested in graphics, you can check that out. It's released under a Creative Commons license. So you can share it with everyone and so on. And the office I live in is based in Toulouse, south-west of France, so not so far from here. So what's the actual purpose of this talk? A few things. First, I would like to present the kind of specific use case I had to deal with in terms of video decoding. So I will be presenting kind of some basics about that, how Linux kind of deal with it, and what's specific to our hardware and our use case, which is about all-winner SOCs. All of this is interesting, but we actually need these different components, the video parts, the display parts to kind of communicate together. So I will also be talking a bit about a pipeline, video pipeline integration over this talk. And this talk is kind of going to feel like a rant, because not many things worked out. And the purpose is also kind of to show you where things went wrong and the issues that I had, especially user space APIs and so on, so that if you kind of have to deal with the same situation, you kind of know where the blockers are and what the current situation of the whole thing is. So yeah, it's going to feel like a rant, but hopefully it will be seen as constructive criticism. And well, things could be worse. They definitely could be worse. So let's look on the bright side of things. So TLDR, it's all about pipelines, like going from video decoding to actually displaying frames. So I talked about video decoding. What is that exactly? Well, you probably know that pictures like arrays of pixels actually take quite a huge load of data. Like that's a huge size, huge files, if you need it to represent all the sequence of pictures as actual arrays of pixels. So in order to make this more bearable, we just compress these images and this series of images that make video files. So how does this kind of compression happen? So there are basically different layers of compressions that are happening. So this is kind of to give you some idea of what a codec is actually made of, what a video codec is actually all about in terms of compression and reducing data size. So one of the first things we can do is color space compression. So the idea is that we are not going to represent colors in terms of red, green, and blue. But instead, we're going to use a different color space called YUV. And this one is interesting because we separate the luminance component, which is kind of a grayscale representation of the image. We separate that from the colors. And the thing is the human eye actually has more sensitivity to luminosity compared to colors. So we can actually use the same color for multiple pixels but still have a single luminosity value for each pixel. So this gives us kind of a very precise luminosity to the picture. But the colors are less precise. But the human eye doesn't actually notice. So thanks to that, we can have one color sample for each block of four pixels, for instance. And that would look just as good for the human eye. So this already gives us some sort of reduction in terms of how we represent pictures. Then there is a second method that is used. This is basically what makes JPEG. This is spatial compression. So we're going to use a translation to frequency space a bit like it exists for audio files. The same thing happens in 2D for images. So in the case of codec, usually it's a discrete cosine transform. This brings us to frequency space. And once we're in frequency space, we can basically remove the high frequencies of the image that are also less noticeable to the eye. And this is kind of what gives the quality of the image. So we can filter high frequencies. We have a slightly worse quality. But then we have a lot less elements to represent. So this is spatial compression. Then with videos, we have something extra compared to, for instance, JPEG, where it's only the color compression and spatial compression. With videos, we also have temporal compressions. So this means that we're going to represent pictures, not only as fixed pixel representations, but basically as diffs between the previews and all the following pictures. So we're basically encoding diffs of what changed in the picture to kind of generate new images. And this is very convenient, because if you look at pretty much any video, you'll see that the whole scene is not everything is changing on it at the same time. Usually you have one object or this stream. If I'm just moving my hand, all this area doesn't move. So you don't need to actually encode that. You can just encode the diff of my hand moving. And then you have a much more efficient representation of the picture in terms of data and space usage. But that's not all. There is a final step called entropy compression, which is basically the same as what you find in a zip or any sort of archive compression format, where we basically just assign symbols to the bytes that are the most frequent in the representation of the image. And so we give those a smaller weight than those that are less common. And thanks to that, we can kind of reduce, once again, the size of the image. So what video codec implements are basically these four different layers of compression. So in order to properly decode those encoded picture, we need some metadata to kind of provide the details of how it was encoded and so how it's supposed to be decoded. So that's a number of different parameters that you need to actually carry on along with the compressed data itself. So this is called the metadata. And once we get both and we group them together, we have something called a bitstream. And that bitstream can actually also have other things like audio. If you have a video, it's nice to have audio as well. You can also have subtitles. And so the group of all of that is what we call a video in a container format. So the container contains the video stream, which is a bitstream with metadata and compressed data. But we can also find audio that might also be compressed and subtitles. So in the end, we have kind of this representation where we have a container with the encoded video and so on. So with these techniques, we have like reasonably small files that we can share with a long time of video. And so that's great. So that's for the video decoding part. Now, hardware video decoding is basically about having specialized hardware to kind of offload this operation of decoding, which is what I'm interested in today. The last thing is we work on embedded systems at Butlin. We're interested in that. And those don't really have a lot of CPU power to fire. So video decoding should be offloaded if possible. So there are basically two major types of hardware implementations to do video decoding. The first one is called stateful, where it's basically that you have a microcontroller attached to the actual hardware. And this microcontroller will do all the metadata parsing, all the engine configuration. It will also keep track of the different buffers that you need to do the temporal compression thing, where you calculate diffs between your pictures. So you kind of need to keep track which images you need to apply the diff to. So in the case of stateful decoders, this is done by a microcontroller. In the case of stateless decoders, this is up to the CPU, basically. So the CPU will still have to configure, like registers, to specify how the video was encoded so that it can be properly decoded. So in Linux, these hardware code, oh, sorry, there we go. In Linux, these are supported in the V4L2 framework. It's pretty new. It's something lots of people have been working on over the past year, or actually a bit longer. But yeah, it really came around last year. This is thanks to the V4L2 M2M framework, which allows supporting the stateful codecs. And we add to add a new API called the Request API for the stateless ones. And this Request API is basically a way to synchronize the compressed data buffers and the metadata that we provide directly to the kernel driver so that it can configure the registers. And so once we have all of this, we have our nice V4L2 drivers that we can use to decode videos. We get decoded pictures out, right? And so what do we do with these pictures? Well, we can either read them from the CPU using the well-known M-Map syscall. We just access the data. But more interestingly, if you want to pipe that to another device, for instance, your display engine, because you actually want to display the decoded video on a screen, then we can use a mechanism called DMABuff, which is just a way to provide a reference to the memory area where the decoded picture is and pass that between different drivers. So we don't have to do an extra copy. And in the end, the kind of expected result of all this is to be able to integrate a decoded video like here with a user interface on a desktop environment. So yeah, that's kind of the goal here. But actually in order to achieve that, there are lots of things to take in consideration. I mentioned that the pictures are not coded in RGB, but instead in a YUV color space. So we usually need to convert that YUV format, that YUV color space to RGB before it can be composed with the rest of the things to display. So that is one thing that is needed. We might also need to do scaling. For instance, if you want to display the video full screen, you have to have some way to kind of, yeah, do the scaling operation. And those are actually quite calculation intensive because you need to go over each pixel and apply an operation to do that. So doing this on the CPU is not very efficient. But thankfully, we have hardware to help with that as well. And this is where the display engine comes in. The display engine is not only responsible for sending pixels through your HDMI link or whatever. It also has this sort of scaling and color space conversion and composition abilities. So we should be able to use that. There's also the GPU that can pretty much do anything. So yeah, it's totally able to take our decoded data and do all the composition, scaling we need, and produce some final image that we can display on the actual monitor. And if we want to do that efficiently, what we really have to avoid is buffer copies. If you're going to use different hardware blocks to do different operations on our decoded buffers, we need to make sure that we don't copy the data between one device and another. They should always use a common area of memory to, as a sync and source for the next block. So that's for kind of the general idea. Now I'm going to present some specifics on Arduino platforms, which is what I actually worked on. So Arduino platforms, those are two examples of boards from our friends at Alimex and Libre Computer. So basically, Arduino SoCs we find in those kinds of single board computers. They are also cheap Chinese tablets that use them. Also random products out there, industrial ones included that yeah, might do pretty much anything. It's embedded, so you never know anything is embedded really, so yeah. The situation on this specific kind of family of platform, there are different Arduino SoCs, different platforms. The situation is the following. So we have this stateless video decoder hardware that we can use. It supports a whole bunch of codecs. We only focused on MPEG-2, H264 and H265 so far. But of course, we hope that soon the other ones will be supported. We have a display engine that can take multiple inputs and do composition of them. So that's quite useful for what we want to do. And we also have a GPU, which is a Mali 400 or 450 in those SoCs. But they also come with weird constraints that kind of got in the way and this is kind of the root of the problems we had. So first off, on an old platform, there is only a specific part of memory that the VPU can access. So we actually had to declare a reserved area of memory in that region so that we could use that to produce the final decoded pictures and also use that to provide the input bit stream. That's the actual contents from the video. But it's not only that, they also have this weird tiled output format. So it's basically that instead of pixels being laid out line by line, they are laid out in a different format. And this makes life a bit more complicated. Thankfully, the new ones didn't have any of those kind of constraints so we can use any memory. It still needs to be contiguous because we don't have any MMU attached. So we still need contiguous memory but at least we don't have this sort of limitation on the area that we can use. And there's no tiling that we have to use in particular. So just to give some visual interpretation of the tiling, this is called Linear of Raster Sky and Order. So this is the way pixels are laid out usually in memory. And on the right side, we see what our VPU actually produces. So the arrows are the increasing memory addresses. So you can see that in this case, it's not like full line, then another full line, then another full line, it's just a small block, then another small block, then another small block. And so it means that if we're going to use this data for another device, we need that other device to kind of understand this representation and be able to deal with it. So in order to add support for this, what I did at Butlin was to first add support for that on the display part. And of course, we also worked on the VPU driver itself. So the VPU driver is called Cedrus. It's in staging currently. It was merged in 5.1, Linux 5.1 with MPEG-2. And since then we added H264 in 5.3 and H265, which was just merged last week or so, and it will make it to 5.5. So that's for the V4L2 side, for the decoder side, and for the display side, we actually had to add a modifier, which is basically the indication that the scan order is not linear or raster, but that instead it's our weird custom tiled scan order. So the way to represent that in the DRM subsystem that deals with display is to specify a modifier. And we'll see that this modifier is actually quite important. Similarly, on V4L2 we also introduced a custom pixel format, tiled and V12, which is also a way to represent our weird tiled format. So basically we have all the things we need in Canon Space. We have drivers that can fit with the constraints that we have. And we wrote a test utility, V4L2 request test that we used to kind of bring up both sides. And we also have a VA API backend. So VA API is just a way to interact with the video decoding side from user space. So VA API will directly call the V4L2 driver side. Canon side. So how did we implement this actually? The first thing we did was to go with a full bare metal setup, right? We decided let's just write things from scratch. And so that's the test utility I mentioned just now, V4L2 request test. With that one we just implemented all the IOC tools to access the V4L2 side for decoding and the DRM side for display. And of course we used the DMABUFF API to be able to avoid copies. So this is called zero copy. DMABUFF basically exports an identifier of sorts of the memory area as a file descriptor. So we just pass this file descriptor between the two APIs, V4L2 and DRM. And in our case, this is the IOC tool we call on the V4L2 side. So basically we create a buffer. We provide it to the driver. It does the decoding and it gives us a destination buffer. And we call this XBUFF IOC tool on that destination buffer. Then we get file descriptor. And that file descriptor we can import into DRM to basically tell the display engine, oh, this is the memory you should be displaying. Yeah. So on the DRM side, the IOC tool is a prime FD2 handle. So it's just a way to import it in the framework. And what we implemented on the DRM side is also all the CSEs, so color space conversion, all the scaling and all the composition using DRM planes. So DRM planes are just basically overlays that you can add on top of your main frame buffer, I would say, your main picture. It's just like you add a bunch of pixels on your final destination. And the display hardware itself will do the composition of only scratching that into one image that is finally sent to the display, to the HDMI link. So this is very useful because doing this all in hardware is very efficient. And so of course we wanted to use that. So that's what we implemented. The resulting pipeline is fairly simple. Yeah, we just take from V4L2 and give it to DRM. We use the planes. And things are quite fast. And there is no copy. There is no software-based or CPU-based operation. So this worked pretty well. But then we tried to use Xorg. And that's where it started getting a bit more complicated. First off, we went for a GPU-less setup because we didn't see any particular reason to involve the GPU in the mix at first. And so the idea was, can we use a DRM plane under X? Can we actually ask the display hardware to do that composition through the X API? And truth is, X has lots of limitations regarding to that. The X protocol only knows about RGB formats. And as I've mentioned, our decoded frame is in YUV, not RGB. So that didn't fit really well. But there are some extensions to allow that. The first one is quite old. It's called XV. And it supports YUV inputs. But the problem is, first off, you have to implement hardware-specific part of X, which is called the DDX for device-dependent X. You have to write a specific implementation for your device on that side just to be able to kind of pick up the YUV data and push it to a DRM plane. So yeah, that meant some work required. But it also doesn't support any zero-copy mechanism. So you also have to do a copy of the destination buffer of the VPU into another buffer to provide to XV. And another problem with it is that it has serious synchronization issues. So basically, if you use XV, say with VLC or any player, and you kind of move the window around, you'll see a bit of a lag of just the video contents not following Pixel Perfect, but having like a tenth of a second lag or even more. So it's really annoying, and it doesn't work great. Then there is also something called DRI3, which is a pretty new extension to Xorg. And it's supposed to kind of cover some of the downsides that existed with X in that regard. So one thing it does is DMABuff import. So we can actually pass the DMABuff descriptor to that API and then hope that the X server itself will properly import it with zero-copy and be able to display it on the actual display. But that one didn't support modifiers. So for the old generation platforms where we have to specify the modifier because it's not a linear scan order. It's really our tiled format. That just couldn't work out. And there's also the fact that currently DRI3 is only supported in Glamour, which is basically the GPU-accelerated X kind of backend. And so if we're not going to use a GPU, basically we cannot use this API. And even then, it doesn't really give us access to using the DRM planes. It's basically only allows importing that buffer with zero-copy to the GPU, not to the display engine. So that's not really something you could use. So in the end, for X, we kind of resorted to a fully software-based pipeline. So that wasn't great. We still needed to do entiling to be able to provide the buffer to X. So we had to implement that. It was actually already implemented, but we had to kind of bring it in to do this entiling operation of putting the pixels back in the expected order. We did that with Neon, so it's on ARM, so it's SIMD. It's much faster than doing it purely CPU-based implementation. But then we also needed to care about the color space conversion, and also scaling, and also composition with the rest of the buffer of the display. And that also involved copies. So we couldn't get a zero-copy setup with that. This led to a very suboptimal kind of setup. You can see the pipeline here. So in VA API, that's basically where we did the entiling step fully in software. That still was surprisingly good. We managed to play 720p videos at 30 frames per second with purely software composition. And the real bottleneck was actually scaling. Like, as soon as scaling is involved, then the performance would be like 10 frames per second or less, which is really not good enough for watching a video. So then we tried to bring in the GPU in the mix, because, hey, it's supposed to be able to do anything. So yeah, it might as well do that. Using the GPU, in our case, means using a non-free blob that ARM provides. And the way to kind of integrate that with Xorg is to use a device-dependent X thing called ARMsoc. And that one doesn't actually accelerate composition. That's one thing it's not going to help with. But it still allows us to kind of use the GPU with OpenGL to accelerate operations. So first, what I tried to do was to import the decoded buffer into the GPU using a YUV import, and then do entiling on the GPU. But that didn't work out, because the non-free blob doesn't support YUV import. So no luck. Then I tried importing it as a 8-bit component. So just reading itch bytes separately and doing the calculation with that on the GPU. It worked well on Intel. I actually developed an entire shader on Intel. It was pretty fine. But when I moved it to the Mali, it didn't work. For a reason, I still don't fully understand. I found some bug reports of people having the same issue, but no one really knows what's going on. Why that doesn't work. And zero copy was also impossible for the 8-bit direct access. So the bottom line is we basically didn't manage to use the GPU for anything, and it was just a waste of time. Maybe if that was on a blob, but instead of free software implementation, maybe we could have kind of dig into it and see, OK, are there actual limitations on the hardware, or is it just a lack of documentation or lack of implementation? Yeah, that could have helped. So nowadays, we have a free driver for the Mali, which is called LIMA. It's still kind of experimental, but it's making good progress. So maybe it would work better with that. We didn't check again by the time LIMA got merged. So another thing is Wayland, right? Like, Wayland is awesome. No one knows that. So if you don't already use Wayland, please do. It's the future. But we didn't really have time to investigate it at the time of this project. But I kind of still wanted to dig around and see, like, would it be feasible? And Wayland has kind of a conflicted relationship with GOM planes. So it doesn't expose them to applications. So an application doesn't know that the system has a plane available to do composition. I mean, at least applications don't know that. So if you have a video player, it cannot say, oh, I want to use a plane for that. Instead, it's considered to be an implementation detail of the Wayland compositor. So the compositor might actually end up using a plane if it can, and if things work out. But there is no guarantee about that. So that's something to kind of remember. Now, as for zero copy, there is a Wayland protocol extension that supports that, called Linux DMA buff. And it has a dedicated interface as well. But that one does support modifiers. So it's cool for our old generation hardware, but we need to pass that information along up to the display engine. But it turns out that the way it's implemented, because the rendering backend is GPU-based if you want to do a DMA buff. Then you also need the GPU to know about this modifier and to support it, which was definitely not the case for us because the GPU blobs we have are not, they don't support the modifier, and they don't support DMA buff import for that. So odds are, it wouldn't have worked with Wayland either for us. Now, the kind of final pipeline that we worked on and which actually worked, which actually provide nice results, is using Kodi. So for those of you who don't know what Kodi is, it's a media center implementation, a really nice one, fully free software, as far as I know. It's really nice. And it uses OpenGL to do UI rendering, and this uses the GBM EGL backend. So if you're familiar with OpenGL bits, EGL is the kind of window integration, window manager integration for OpenGL, and GBM basically allows using DRM directly for that. So it means you don't need a display server. And it's nice because then the application, Kodi, is directly in touch with the DRM subsystem, and so it can actually use planes, right? Planes are not hidden away by a display server, but they are actually directly usable. So what worked was to just use DRM planes in Kodi. There was already a fair support for that. I had to tweak a few bits to really get it to work. Also had to integrate our VA API plugin to retrieve the frames and do the zero copy export. But in the end it worked out, and so we were able to get Kodi up and running with our hardware, with our modifier, with everything we needed. So yeah, that's kind of the only non-test-based pipeline we got working. So it's almost time for this talk. Some general takeaway, or at least things that I've noticed and kind of wanted to share. First off, don't expect that DRM planes will be made available to your application. At best, they will be available to the Wayland compositor, but the Wayland compositor is not going to expose that in any way, so don't expect that you can use them directly from user space. I'm not gonna say whether this is a good or bad decision. It certainly makes sense in the context of Wayland. It certainly makes sense, but that's the way it is. Also Modifier Support is something pretty new in user space, only the Wayland extension did have a notion of modifiers. Basically any Xorg extension that you will find doesn't care about modifiers. So if your frame buffer is styled, basically you're screwed. Also not about the fact that most of the user space stack nowadays expects that you have OpenGL available to do everything, but in our case, using the GPU was really more a source of issues and headache rather than really something that solved an issue, because it was probably in part because it was a non-free, proprietary blob. Also because it didn't have the proper implementation for GMABuff and also that it had this weird limitation when I tried to import the decoded frame and that it just didn't work. So back to planes, there are actually some projects that exist to try and make the use of planes much easier. There's one called LibLift-off. It's still under discussion. It's a very new thing. I think Simon Serre is the person working on that. So if you're interested in actually exploiting planes and having a library that makes that easy, you can get in touch with that project. And more generally, there is a LibOutput project that kind of also aims to make display server implementations more robust by having a common code base. So LibOutput would help also use DRM planes in the context of compositors or things like this. So that's it for this presentation. I think we still have a few minutes for questions. So if you have any, feel free. And otherwise, thank you. No questions? Well, cheers. So does SOC have some sort of hardware scaler unit? So you wouldn't have to do the scaling in software? Yeah, so that one was integrated with the display engine. So when we use DRM planes, we have the opportunity to use the scaler that comes with it. So this is why we wanted to use DRM planes as much as possible. Because there are SOCs that have dedicated scalers. Did you ever look into that? And is it supported by Linux? So yes, there are some SOCs that have dedicated scalers. One example that comes to mind is the Rockchip RGA, which is exposed by a V4L2. So it's a memory to memory device. It actually uses the same API as the one we have for Codex. And that one, you can basically, yeah, give it any frame buffer as input and tell it to what it should scale, and then you'll get the scaled result as output. So that does exist, but I'm not aware of any projects kind of leveraging that from user space. So you have the kernel side is there, but yeah, I don't know if it's really used a lot. But, oh, it is used in JetStreamer, okay? Okay, and also it exists on Exynos and IMX6. So yeah, that does exist. Nice, thank you. Just anyone else? I don't bite, don't be shy. Yep, we have one over there. Thank you. So I think from my slightly limited experience with these things, I think the way they kind of expect you to use this a lot because a lot of these chipsets were actually designed for Android. And the way you do it there is by essentially using a GL extension, I think it's called external image, essentially abuse of the texture from the video surface. Is that something you kind of looked at? You know, like basically taking the memory from the VOD Coder and kind of piping it straight into a GL texture that you can then display like you would any other GL texture. That's the thing I tried to do and fail at doing. So that didn't work out for unknown reasons basically. Blob-related reasons. I'm sure that was just how it was supposed to be used, but okay. Thanks for that. So I'll just add to that. If you use some of the free software GPU drivers, you can directly import with DRM something something, the MA buffs into the GPU and get a GL texture out of it. Yeah, but that only works if the modifier is also supported by the GPU, which in our case was just not the case. The modifier was only supported by the VPU and DRM display, but not by the GPU. So this is another reason why the GPU didn't help solve the kind of pipeline issue. Okay, I think we're out of time, so thanks.