 Hi everyone, I'm Maxime Repault from Boothlin and I've been working on a video codec to be able to do video hardware coding on an all-winner platform, so an ARM SOC, and I'm going to talk a bit about how did we manage to make it work and how it's going to work in the next kernel releases because it's something that is brand new. So first, let me introduce myself. So I've been an embedded engineer at Boothlin for the last seven years and a half. And basically we're doing embedded Linux development and embedded Linux training. And I've been contributing a lot as part of this job. So I'm the co-maintenor of the all-winner SOC support in Linux for exactly six years now. I've been working on it at TLC 2012, so it's awesome. And I've been for a couple of weeks the co-maintenor of the DRM MISC subsystem, which is basically the subsystem responsible for the graphical stack in Linux and the MISC part being basically everything that is not NVIDIA, Intel, or AMD. And I've been contributing to a lot of open source projects over the time. The build route, shoe route, barebox, and so on. So let's get started on video decoding. So first, we need to talk a bit about what an encoded video is. And it's basically two things. You usually will have, so the file itself will not be the video itself, it's first a container. The container will be basically the format that is used to organize within that file the video streams, the audio streams, and some other metadata, for example, the subtitles and so on. So it's basically how the file is laid out to store all the streams and data that are, well, basically what you can expect when you are playing this video. And then you have one or multiple codecs if you have video and audio, which will encode the data that needs to be stored in a compressed format that you can have a decent file size and not blow your hardies because you have like two or three video files. So we are only going to talk a bit, we are only going to talk about the codec itself. We are not going to talk about containers because as far as video hardware accelerators are concerned, the containers are really not important. It's usually the video player or one of its libraries that will extract the video stream from the file and pass it to the decoder. So yeah, we are only going to talk about the codec. And the codec basically relies on something that is called the bitstream. And the Wikipedia definition for it isn't very helpful. So the bitstream, if you look at Wikipedia, is basically a sequence of bits. Great. But in the context of codecs, it basically is a compressed output of the encoder. So it's going to be your encoded video data that is sitting on your hard disk when you have downloaded or encoded a video. And it's mainly composed of three things. So we are talking in general here. Some more advanced codecs have more things in the bitstream, but basically you will always find a separator between frames so that you are able, by just reading that stream of bits, to tell when a frame starts and when a frame stops. Then the metadata, which are holding the compression parameters. So for example, one of these parameters might be the references of the image that have been used as references to be able to compress that image. And finally, things that are called slices, which are basically the compressed output directly. So it's basically the data. And so decoder usually will look like this. So you have your video files here on the upper left. And you will see within the containers a video bitstream with each slice containing raw data and are separated by separators. And you take the raw data. Yeah, decoding is actually a multi-step process. So first you are going to take those raw data and give them to something that is called the bitstream parser, which is going to extract the metadata and the slices from the raw data that are stored in that file. And so once you have the metadata and the slices, you will give it to two different units. The first one being the controller that will take the metadata and control what the decoder is actually doing. And the decoder will also take as input the slice so that it knows using the combination of those metadata and the slice itself, it's able to produce a decompressed frame that you are able to look at and hopefully display or whatever you want to do with it. And most codecs that decoded frame will also be used as input for the subsequent frames to be able to perform a descent compression. And so yeah, the decoded frame is also the input of the next frames to decode. And so when you're looking at hardware decoders, for the one that are found in SOCs or GPUs, most of the time, or at least in the past, they've been based on a design that is called stateful codecs. And the stateful codecs are actually quite nice, at least from a programming model point of view, because you will just be taking those raw data I was telling you about, give them to the codec, and the codec will have all the units that need to perform the job in hardware. So it will be, sorry, it will just give it the raw data and get back a decoded output without any more intervention from your part, which is very nice from a programming model point of view, because it's quite simple. But from what I heard, it's actually pretty difficult to get right in hardware. So most of the time, hardware will be a little more complicated. You will have to have firmwares in place that you will have to develop and so on. So it has been replaced by some kind of a new design that is called stateless codecs, and that we are going to talk about later on. So if we wanted to talk about, if we wanted to support stateful codecs in Linux, the API that we need to use and the framework that we need to use is called V4L for video for Linux 2 in particular. And so video for Linux 2 has been introduced quite a while ago, so in 2002. And it's basically supporting everything related to video in Linux, as its name suggests. But it's not about only the codecs. It's also about cameras or DVB receivers, this kind of devices, so everything that will produce or consume video in Linux. And so, yeah. So video for Linux is using some kind of a sub framework that is called M2M for memory to memory in order to support those kind of stateful codecs. And if you're looking at a very dumb pipeline using V4L, you will be having an application that will feed V4L to driver the road data buffers that we were talking about, and we'll get back the decoded pictures as soon as the decompression is done. So that part is pretty simple. The only thing that is a bit unintuitive, at least in my opinion, is the nomenclature of the cues and whether output and capture are actually output and capture. So it's actually seen from the user space application and not from the system itself. So for example, the raw data that you are going to put inside the codec for decoding is actually your output, and you will get back your decoded frame using the capture interface, which is kind of weird to me, but, yeah, anyway. And so, yeah, if we wanted to support a stateful codec, we would have something like this. The application, or possibly using libraries and frameworks and so on, would take the bitstream, split those frames at each separators, feed the data to the stateful codec, and get back the decoded frame, everything is fine. And this is actually working quite well. So basically, every codec supported in Linux these days is a stateful codec. So for example, the Bayley-Briggs have been doing some great work on the MLogic SOCs recently. They did, for example, a great talk at Embedded Recipe a few weeks ago, and they were talking about their video codec. The video codec on the MLogic SOC is actually a stateful codec. It's decoded. You have plenty of support in the user space, libraries, framework, and so on, to be able to use it. Everything works great, except when you start to introduce a video codec, a stateful codec, which will only have the decoder part in hardware. And so everything else, so the bitstream parsing, the controller, and storing the decoded frame and so on, so that the decoder is able to do his job, has to be done somewhere else. And so the design decision has been that especially the bitstream parsing had to be done in the user space, because the bitstream is basically a file coming from somewhere else that you basically cannot trust. So being able to par that file in the kernel is quite first difficult. And then you have all kinds of security issues that you cannot prevent or would be difficult to prevent. So the decision has been done to have the bitstream parsing in the user space. And so you have to kind of split everything apart. And it's especially more difficult with V4L, because between the controller and decoder, you have to pass all those controls to be able to tune the decoder so that it's able to decode that frame properly. And so you had an API for that on a list of file tools that were able to do exactly that operation in V4L2 except that you had it was completely separated from the buffers themselves. So you were able to change the controls, but you were not being able to synchronize that to a buffer. And so in this particular case, you have to change exactly the controls in lockstep with each buffer so that the decoder is able to do its job properly. So there's been some support, some work at least for quite some time on something that was called, that is still called the request API, which is basically an API that allows you to combine and have those control change in lockstep with the buffers themselves, so being able to be able to do just that. So the first RFC was sent in 2015, and then it basically hopped over a few people because it was also something that could be used for cameras, for example, to be able to... So for cameras, for example, changing the exposure of a camera shows basically the same issue. So when you change the exposure and capture some frames, you have no way, at least at the moment, to say when, at which frame exactly the exposure had been changed, which is kind of an issue as well in some use cases. So the API went back and forth between the codec use case and more the camera use case over the years with a bunch of people stepping in to help and try to address it for their particular use case. And, yeah, the codec side at least was finally merged in, well, what will become 4.20 or whatever it's called. So at least now we have support for stateless codecs, or at least an API that allows us to support stateless codecs in v4l2, which is great. And so if we go back to the stack that we were discussing before, now it looks kind of like this. So you will still have your container, your video bit stream, your raw data that are going to be fed and passed by your application, which will then feed the slices and modify the controls using whatever are coming from the slice and metadata to your v4l driver, which will in turn give you a decoded frame that you will keep so that you can use it as reference for later frames, and so on. And so only the decoding part is now in Linux, and everything else is in at least user space. And so we've been working the job mostly on all-winner SoCs, and so all-winner produces multimedia SoCs that are mostly targeted at tablets and STBs. It's one of the SoCs that are very likely to be found in those two cheap SBCs that you are finding at like less than $10 on Alibaba, for example. And so just like any multimedia SoCs, they have a hardware unit to be able to decode and encode videos, except that it's using a stateless codec. So we had to have that particular design to be able to use it. And so all-winner is giving all the ARM, SoC, vendors a BSP stack. And in the case of all-winner, it's kind of an outdated one. So the kernel that they are shipping is actually either 3.4 or 3.10, even these days, which is pretty old. And for their particular hardware codec, they are not using v402, but they basically have some kind of private API. And it's based on a stack that is basically split in two parts. Where you will have a kernel driver that is basically just here to manage the resources, adjust the clocks, be able to get interrupts, memory maps, the registers of the unit, and so on. And all the logic will actually be in user space. And that part was closed-source for quite a while. So we had basically no idea how the hardware was actually working. And that kind of design would not fly with mainline kernels. We were kind of stuck for quite some time. Except that there's been some reverse engineering effort that have been done on this particular driver, which is, so the hardware unit is called CEDAR, like the tree. And so the reverse engineering was called CEDRUS. And so it was basically an effort to be able to have an open-source stack based on to be able to drive that particular hardware unit. And that reverse engineering was done for basically all the meaningful video codecs in decoding and then only HG64 in encoding. But it's not because they didn't have the time to do it. It's because the unit is actually only able to encode in HG64. So we basically had most of the codecs and features figured out. But it was mostly targeted at all winners BSP. So it was just a way to be able to replace the closed-source decoding part that we were telling you about. And it was still relying on the same API that all winners was providing to that small kernel driver. So it was completely functional. There's even been some SBC vendors that were shipping this reverse engineering project as part of their BSP to their customers. So it was completely working. But since the kernel was quite outdated and not really maintained anymore, it wasn't really a way forward. It was just a way to have a stop-gap measure. And we were not able to use it in mainline at all. Oh, yeah. And they were providing as well a little depot implementation so that you would be able to use it with popular media players. So LibVidepo being one of the APIs to be able to provide decoding capabilities to regular multimedia players. So you would be able to use VLC and so on using that work. Which was a great reverse engineering effort, but not quite enough for us. And so at Summer 2016, we actually had an intern for the summer that started working on that kind of driver and worked on an RFC. And considering that it was only about two months, I guess, it was actually very successful. So by the end of the two months, he had an MPEG-2 implementation working, or at least an MPEG-2 decoding working, and a LibVIA implementation to be able to integrate in the popular media players. So I was telling you about LibVidepo. LibVidepo is basically the standout that is pushed by NVIDIA, I think, while LibVIA is mostly about by Intel. And actually, the LibVIA API worked better for us, so we chose to go with LibVIA. So we had basically some basic support for MPEG-2 decoding and some integration into the popular media players. But it was still really a prototype. So you could have... Well, we definitely still had bugs. For example, we had frames that were backwards. So you had one frame, then the frame that was supposed to be before that one was actually after, so it looked like some weird glitches. And it's actually because in video codecs, the encoding order is actually different from the displaying order so that you can compress more efficiently. So you have to take that into order, and we didn't at the time, which led to some interesting bugs, at least artifacts. But the main issues were actually that it was really slow. So we could only play videos that were at the resolution of the screen. Otherwise, it wasn't really working. Actually, on the displaying side. So, and who cares about MPEG-2? I mean, everyone is playing H.264 and H.265's videos these days. So it was a great way to have a prototype, make sure it's working, but it was nothing more than a proof of concept. And for a year and a half, we've had a lot of people coming in interested into pushing that proof of concept forward, but it was a significant effort, and we were not able to do it without funding, but no one really wanted to fund that effort on their own. So we finally had the idea to fund it through a Kickstarter campaign that started at the beginning of 2018. And it actually worked great. It was the first time for us that we actually even tried, or even considered doing a Kickstarter campaign to fund mainline development. And we achieved our goals, even beyond what we were expecting. So we even committed to develop the drivers for more SOCs than we were initially intending. And we committed to develop H.264, H.265 decoding. And it allowed us to fund a full-time intern for six months, plus a part-time engineer, so the full-time intern being Paul Koscialkowski and the part-time engineer being me. And we basically built on top of the prototype to be able to work on more SOCs, reduce the number of bugs, obviously, and try to work on the slowness issues that we're seeing. And so here is what the seller stack is looking like. So we have the bitstream parser that will actually be part of the video players or video frameworks, so things like VLC, FFMPEG, and things like that. We will get from the video player the metadata and slice that have been parsed already, and they are given to our DBA implementation that we called LBA before L2 requests. And it's actually because things that API is pretty generic and the LBA API is pretty generic as well. It can be made to work on any stateless codec that we need to support. So it's some generic piece of code that has been tested on a single SOC so far, so it's probably not generic enough yet, but it should be generic. And we have a V4L2 driver that is made for our particular Cedrus driver. The video decoding actually works, it works pretty well, but actually what we found the most difficult was actually displaying that decoded frame. So in an ideal world, and it's basically what Cody is doing, you would have a learning scale with the V4L2 driver that we have, all the implementation driving the decoded... driving the Cedrus decoder, so giving it the bit stream, getting back the decoded frames, and then Cody would take that decoded frame, give it to the KMS drivers that we have in the kernel, and tada, everything is displayed, except that. So the decoded format is actually in a proprietary format, which is not that difficult to guess. So it's a YUV-tiled format, and actually most of the hardware engine are actually outputting some variants of a tiled format somewhere, and it's actually because it's more optimal to be able to work like that. So they basically all do, but the exact tiling, so the tiling in both directions are basically always different from one vendor to another, so you have to figure it out. But fortunately for us, the display engine is actually able to process that proprietary format without any conversion, so we can just take the video decoder output and give it to the display engine, and everything works. Another issue would be that scaling also, so that you can upscale or downscale the video to match the one of your screen, is not doable easily. So we could just use once again the display engine to be able to use. It actually has a hardware scaler, so to be able to use that scaler to be able to upscale or downscale without any performance hit or anything, it's basically completely done in hardware. But X11 doesn't allow you to do that easily, so we basically hit a wall there, and X11 pretty much expects that the format is not tied, and it doesn't have any kind of hardware acceleration for scaling, so yeah, it didn't really work for us. So we tried a number of solutions. The first one being, well, let's convert that tied format in software. So it's actually very, very slow. So on small resolutions, it's actually kind of works. So for example, for 400 ADP, it's good enough, except that when you start decoding higher resolution videos, it just doesn't work anymore. And it's kind of taking all the CPUs to be able to decode your videos, which was kind of the point of using, well, kind of the point of using hardware decoding was actually to offload work from the CPU. So if you're using just as much CPU than it doesn't, it really isn't worth it anymore. And then X11 has an extension that is called XVideo that is meant to be able to kind of accelerate those kind of issues, except that when we started reaching out, for example, to VLC developers, they were basically stating that it was completely deprecated and they would remove it in the next release, because it didn't in the end, but it didn't really look like a way forward. And we would have had some glitches as well, because in X, all the composition, so the layout of the values, windows, transparencies between them and so on, kind of expects that X owns all the buffers. And XVideo would take one of those buffers out of what X is expecting. So we would have had some issues, for example, with transparencies and so on, which wouldn't really work. So we tried something else. We tried to import the decoded format in the GPU and do the untargeting and scaling in the GPU itself, except that we are using a Mali GPU, so Thono is not really supported quite good enough in Open Source drivers. And the OpenGL blob has a lot of constraints, including the one that it cannot really just work for us. We had some kind of weird bugs when we were trying to use a shader to be able to do that and tiling with lots of precision on the tile edges and so on. And we actually have some quite low memory bandwidth, so if we can avoid having memory coming back and forth between hardware units and main memory, it's good. And then we considered Wayland as well, which is supposed to actually be able to deal with this kind of issues except that we would obviously leave all of our users using X11 in the cold and then we would have to patch all the Wayland compositors to be able to support our proprietary format, which doesn't seem really ideal either. Yeah. So what the current state is, is basically the request API has been merged in the next categories. So thanks to Hans for that. We have a LibVA implementation that is working on top of it that supports MPEG-2 H264 and H265 decoding. We also developed as part of our driver development some kind of tool that is called the V4L2 request test that basically will play a video that we captured, so basically a LibVA kind of session, and we can just replay it over and over time using that tool and the kernel itself, which is quite nice when you want to just work on the decoding and you don't care about all the rest of the user space tag that you want to integrate into. And we have the first part of our service drivers that has been merged in 4.20. So it only supports MPEG-2 for now, but we have H264 and H265 patches that have been sent. And it has been merged in staging for now because since it's pretty much the first user to use that API and that API is quite new, we actually wanted to be able to change that API if we want and, well, if you want to do that, you have to be in staging. Otherwise, the ABI rules are kind of strict, so that's why we are in staging. And I'll be doing a demo, I think it's tomorrow, at the technical booth of this work, basically. So we have Codi running on top of our LibVA implementation and the Linux kernel decoding H264 video. So if you want to come take a look, please do. So that's it for me. Do you have any questions? There's some mic on both sides of the... No questions? Oh, one. So what's the user space port like in external libraries for this kind of code? You mentioned it's Lib3A, but does FFMPEG have a shortcut or does it have to go through Lib3A? We actually wanted to... So the end goal is actually to be able to have FFMPEG and G-streamer and some be able to directly use the request API to talk to the kernel and all LibVA implementation is basically just a stop-gap measure so that we can have something working now. But considering that the G-streamer guys have been starting some work on it and so on, I don't really expect it to last more than a few years before being completely, like, deprecated because FFMPEG will be able to talk to it directly. Good. Thank you. Okay, there was no more questions and I guess that's it for me. Thanks for attending.