 Hello. Welcome to this presentation about Linux Codec. Let me quickly introduce myself. I'm Nicolas Defrein. I've been working at CollabRA for over a decade. I'm a core GStreamer developer. And I contribute from time to time to the Linux media subsystem as a reviewer for the API, as a power user of the API. And I've been contributing to enhancing the Codec support for a couple of years now. So, Linux Codec support isn't a new thing. It started quite a while ago. In fact, it started in 2011. In 2011, Google partners with Samsung and ASUS in order to produce the first ARM Chromebook. It's based on Exynos 5 SoC. It included the Samsung MFC decoder, which today we denote as a stateful decoder. So, back then everything was really great. The laptop came out, all these drivers, these new subsystem came in mainline. Everything was fully mainline. So, what's a stateful decoder? Because we cannot start a discussion about state less decoder if we don't really understand the difference. So, consider the stateful decoder is a black box. And the black box will contain typically a processor, or it could be called a DSP or something else. And this processor will receive a bit stream. So, it's an encoded video stream. It will process it and do whatever it's needed to feed some accelerator, that's what ACC means, in order to produce images. So, it's quite straightforward. Now, what Google came with, basically there was nothing in the Linux kernel that was already supporting this model where you feed in some memory on one side and feed out some memory. So, in order to do that, they extended the V4L node in order to support two queues. So, they added the output queue, which is the actual input of the memory to memory device. I know it's very confusing, but they had to live with the legacy of V4L API. So, the output queue receives the bit stream. And they also added a capture queue, a bit like a camera capture queue. And this one will produce the images, the decoded images. And above that, they add some support for draining, flushing, things that are needed to seek inside your stream or to end your stream. Now, it's a great simple technology. It's very minimal. To implement this in user space, you need a very minimal understanding of the codecs because you just need to pass the frame you receive from the demuxer, basically from another piece of software to the hardware, and you get images back in the right order. The downside is that pretty much all of these hardware requires a firmware because you need a program to run on the processor, which is often custom architecture. So, it's a proprietary firmware using a proprietary compiler. So, it's not very open-friendly. And it can be very hard to multiplex. Your ability to run multiple streams depends on the API implemented on that processor and the memory on that processor. So, it has some limitation, but it's very quick to integrate. So, the story about stateful decoders and encoders actually continued. In 2014, we saw the CODA driver being merged. This was developed by Pengutronics, Philip Zabel. It was to support the chips and media CODA chipset, the 960 initially, and then the HX4, which is IMX51, found on IMX51. So, it gives you the IMX6 and IMX51 support mainline for both decoders and encoders. This effort was all reverse engineer from an XP binary blobs. And some documentation found here and there. Now, the story continued. Everything was good. Everybody was happy with that. Until 2015, 2015, Google started to partner with Rockchip in order to build a second generation of ARM Chromebook. But these ARM, these SOC no longer had a processor with their codec. Instead, they had what we call a stateless decoder. It was the beginning of stateless, assumed to be a Rockchip design back then. So, unlike the previous Chromebook, it didn't go as well in the upstreaming side. So, even though Google have made great effort to produce, I mean, upstreamable code, it didn't get in right away. So, that's where they were. Oops, wrong button, sorry. Now, to make the analogy, a stateless decoder doesn't have a processor. So, basically, it exposed the accelerators in a certain box with a certain set of parameters. And in order for the user space to drive these accelerators, now you don't only have to pass the bitstream. In fact, you will pass a subset of the bitstream. You will have to pass the reference frame that are used in the decoding process. And you will have to pass a fairly large amount of parameters for that frame or that slice to be decoded in order to obtain a picture. And the picture is no longer in the presentation order. So, you might have to do some a little more work. Now, some of you may wonder, how is that different from the GPU decoder? Well, the truth is that it's not different. It's the same model. It's stateless. The difference with your GPU is that the accelerator is actually exposed through the common stream. So, you send a common to a coprocessor. And that coprocessor will take this common and feed the registers for your accelerators. So, there's an indirection through your GPU. And this common stream is a bitstream, and it has to be crafted. And traditionally, for GPUs in Linux, we actually use user space in order to construct the bitstream. So, it's different in that sense. And these API would be exposed through user space driver, VA API driver, VDPAU driver, or on Windows with DXVA2 driver, or with NVIDIA would be NVDeck driver, which is cross-platform. So, it's slightly different. Now, could we have integrated those stateless decoder as being GPU? Of course we could. We would have had to create a generic bitstream for those. That would have been nice. But then we would have had to deal with the fact that we have only one really suitable API, which is VA API, which has some limitation. And you have to deal with multiple GPUs, which until Vulkan was fairly hard to do. So, that's not what Google did. And they were in 2015, so Vulkan was not very big back then. So, they decided to extend the stateful decoder model, so same model, output and capture, and add a set of controls to pass the parameters. So, basically the fact that v4l has a queue of buffers and is aware of the buffers will allow to symbolically refer to the reference frame, so you don't need to queue them. So, you can put that in your controls. But then you pass controls. But the semantic of controls in v4l are that they are applied to the next frame to be decoded. But we have a queue of frames. So, they needed an extra mechanism. And that's where they came out with the request API. So, the request API is a way to queue parameters, controls and bitstream buffers for a specific request. So, they are decoded, they are handled together by the processing. And now the request API was meant to be used also for cameras. We'll see where it goes. There's nothing implemented yet in this regard. So, it was added to the media controller API. So, we had to place our VPU, our M2M, inside a media controller with the nice addition that the media controller has a topology. So, there's now a way to describe the functions of your hardware, which make identification of your memory to memory driver much easier. Just to compare with the stateful decoder, we would look at the formats and try to guess if this is a decoder on an encoder or a color transform or a scalar. Well, with the topology, we can navigate the topology and we'll find a node which has a function and this function is decoder, which makes things much simpler. Now, wouldn't be complete understanding of stateless decoder if I don't give you an example of a format that has been decoded. So, I took H264 because it's fairly complex but it represents well the work that you have to do in user space in order to handle those codecs. So, H264 works with sequence of Nalui. So, a unit of transmission is in Nalui and these Nalui have different functions. So, just a couple of well-known Nalui, the SPS, which is the sequence parameter set. This is set of parameters that you need to carry on that you will use for multiple pictures and then the picture parameters set are extra parameters that you will need during the decoding, both in user space and by the accelerators. Then you got the slices. The IDR slices are slices that do not refer to other decoded pictures or the I slices, but I picked the IDR. Let's not get into the detail for that, but IDR and I slices. And then you got the P slices, which are slices that will use previously presented pictures in order to decode themselves. And the B slices, which will use pictures that will be decoded, that will be presented in the future or N in the past actually to decode themselves. So, they have both past and future references. How does it work in practice? Basically, you decode the B slices later than they are presented. It means that the decoding order differs from the presentation order. So, there's a whole specification and a process to reorder these frames and to present them at the right time. Now, then now you can be stream in two formats. So, you got the start code, the annex B format. With the start code, you have three bytes, 001. It's a pattern that you can search inside your binary in order to find the beginning of a now and to start your parse processing. So, you can start from any random point in the stream. And it's quite suitable for TV streaming over the air where you actually join the stream at any moment. And you got the AVCC format, which instead of this start code, we actually announced the size of the following slide, slice. So, if you want to walk over slices, you can actually skip by size, which is much faster and much more efficient. And this is used for storage like EZO MP4 or Matroska. Now, in order to decode this, and this is just a quick overview, so you have an idea of the complexity, but not that complex understand that everything I'm mentioning there has a specification. And if you follow properly the recipe, which is, of course, written in pros and not in code, you'll get the right result. It's a recipe. But basically, you have to locate and parse all the now headers, which will give you basically the type of the nows that you're dealing with. Then the non-VCL, so the non-display nows, have to be processed by user space, so you accumulate this information. And you also need to parse the header part of a slice With this information, you'll be able to calculate the frame number and handle any gaps. That's the purpose of the frame number. And calculate the picture order count, the puck and the picnip, which is used in the reordering process and the referencing process. There's a lot more, but this is just an overview. Now, some decoders are slice-based. Think VA API or Cedrus driver. We'll come to that later. And some are frame-based. If they are slice-based, you need to give a lot more information, so you need to prepare some reference list in advance, as the hardware won't do it for you. Then at that point, it's the right moment where you can actually program the hardware. So you can fill the SPS, PPS, decode parameters, slice parameters. That's V4L specific. And you can also prepare the finalized reference list. So we call them the modified reference list, or L0 and L1, and pass that to the hardware. And the hardware will decode the slice. And when that is decoded, remember it's out of order. So now you can do what we call the DPB management. So display picture buffer management, which is a process, another recipe that will give you which buffer is ready to be outputted to your screen. And that's about it. That's what you have to do in every single software. On V4L side though, it's a much simpler API because everything is ordered. So you allocate a request, which is a file descriptor. You set per frame slice, or per frame, or per slice parameters that you associate with that request. So when you set the control, you pass the request FD. You queue the bitstream buffer. Again, you pass the request FD. And then you queue the request itself. That's an API. Queue the request. And then you can wait on that request for this job to be done. And you're good. You can continue with the next decoding. It's a bit simpler than the traditional V4L approach. But as I said, this didn't land immediately and time passed. So now we went from 2015 to 2016. There's another stateful decoder, the MediaTek VPU, which we got added with a couple of formats. With the particularity that this one only produced style format, which is a new era. You can probably talk with Neil Armstrong about the subject. It's a whole subject for a talk. So I'm going to skip this bit. In 2007, even though the driver started a while back, the Venus driver for a Qualcomm chipset actually landed. And it landed with both stateful decoder and stateful encoder. And as of today, it's the most capable driver with the largest list of codec supported. So pretty much everything you would need is supported there. But as for the upstreaming of the stateless, it was pretty much stalled. The reason for this stalled was normal. People were actively working on it. But they had to settle on the request API. And there was a lot of direction. At some point, the request API was called job API. And then it came back to request API and some detail had to be tweaked. And all this was done by maintainers that didn't have a very large knowledge of codecs. In fact, they had pretty much no knowledge of handling a low level codecs. So they had to learn this. And that's what takes most of the time to get enough knowledge by the maintainers to be able to be confident that this is the right thing to merge. And there was only one hardware that was supported, the Rockchip. And to make it worse, there was not yet a formal specification. Even though there was no formal specification to handle the stateful codec and that got fixed meanwhile. But that was quite a problem. Because that was a problem for consistencies and portability of the solution. Now, everything started again with Bootlin. So Paul and Maxim from Bootlin came and created this Kickstarter with the goal to make a mainline driver for the all-winner VPUs. So what they end up managing to do is to help finalize the request API. So that got finalized and merged upstream. They actually managed to get the first Cidrus driver so they called the driver Cidrus. It's the all-winner VPU. It's name after the property stack they were actually reverse engineering. So with MPEG-2 support, so that was quite a good start, MPEG-2 is much simpler than H.264. And H.264 was being developed actively. With the target goal to implement a VA driver for that codec as a stopgap. So with VA driver, they would get user space support much quicker because VA was already supported by some browsers and by FFMPEG and other systems and GStreamer. In 2019, things started to speed up a little bit. And my team, well, Collabra not really me directly, but some folks at Collabra, Boris and Ezekiel started to work on finalizing the Rockchip driver upstreaming. So take the opportunity of that movement to actually get somewhere with that. And that's where the Rockchip driver was finally mainline. Cidrus gained H.264, but also HEVC support. And then the controls for the codecs were added, but they are added as a staging API. I'm not sure if a staging API was really a concept back then, but now it is. So it's an API that only exists inside the kernel, and you have to copy code from inside the kernel if you want to implement it. This was a way to actually get things upstream earlier so we can collaborate on all this and finalize all the API. And we're actually still working on the API and there has been a proposal last week to finalize the upstreaming. Oh, and in 2019 Rockchip driver actually got renamed, and I got a little story about that, and that's quite interesting. So at the same time we were working on this, on the Rockchip 3288 and Rockchip 3399. Sorry. So Chris Halley from Zodiac which is now Safran actually was talking to me about the new in-flight entertainment system based on IMX-8M device that they were coming back, and he was saying that he was going to join with his team, while his team being Philippe Zabel at Benutronix the group of people working on stateless codec and that he knew that the codec was a Hantro G1 and a G2 there. So I was a bit curious a bit later he sent me an evaluation kit so I could try it out and have just play with the IMX-8M quad which was brand new and what's cool with NXP is that you can actually go on the website and in exchange of your email address, which can be annoying, but in exchange of your email address you actually get the specification for the board and sometimes some bits of the specification are missing but the codec was not missing, it was there and I also had the specification for the rock chip and I started comparing side to side the registers and what I found out is that well it was the same hardware, just kind of an older version on the rock chip so about the same day I wrote to Philippe Zabel and I think I told him, oh maybe you should speak with Ezekiel and Boris they might have upstreamed the driver you're going to write and that's where it started and IMX-8M actually arrived very very quickly afterward as most of the work was there if you're curious about what Hantro is, so Hantro was, it's no longer a Finnish company producing codecs, that got bowed by Antu and Antu got bowed by Google and while at Google the Hantu faults actually produced the very well-known VP-8 and VP-9 or royalty-free codecs and they also with the Hantro faults at Antu, at Google kind of nested actually made hardware design for these codecs and they actually give it for free to Silicon vendors I think they still do that today and later Google sold the Hantro team to VerySilicon and VerySilicon is actually keeping the work on these codecs and there's new version coming well published and coming it was interesting because when we discovered that it was Hantro codec and not Rockchip we renamed the staging driver as it was staging it was not an API yet but it reminded me of this story about STM Mac so STM Mac is an Ethernet driver that was thought to be an STM design but later they found that it was actually a design where design and it's not just a design where it's actually the design where used on most SBC today for 4 gigabit Ethernet and but still today because of API stability we had to keep the STM Mac name even though it's not specific to STM chips so history avoided nice catch we were pretty happy now all this effort would not have been possible without the help of the community so at some point in 2019 perhaps a little earlier I'm not too sure of the dates there but Librelek and Codi contributors actually came by and they produced a port of FFMPEG that supports the those stateless codec the port is not mainline yet because the FFMPEG people would like the the API to be final before they merge it into the truck but you can download the fork on Quibu's git repository if you ever need it come by on Slack I'll give you the link they provided lots of bug fixes they've been testing corner case testing a lot of I mean slightly broken streams and everything and they also provided interlaced content with matters for their use cases not all chip supports interlaced content yet but this is to but the API itself has been fixed already now things didn't stop so in 2020 RK3399 support has been merged this time we believe it's a proper rock chip design so it has its own driver in Gstreamer something moved so we started having a base class base classes for codecs so basically a framework FFMPEG had that for years many many years and that's probably why they have a richer amount of stateless codec interface supported but that was finally added to Gstreamer in order to support DXVA and NVDeck by the way big thanks to Chromium as most of the code we have there is based on their code and I added Gstreamer H264 and VP8 decoding support using the V4L stateless interface and we don't have FFMPEG restriction so that landed into Gstreamer plugins bad bad being our staging in Gstreamer and I was supposed to present all of this at embedded world this year which got cancelled because of the COVID but nothing of that is lost it's available to everyone and we're still working on it we're still improving I'm trying to finalize the API and I'm trying to help out and I'm going to add more support into Gstreamer later might try and help in the upstreaming of the FFMPEG there's a lot of projects ahead of us meanwhile the effort to produce a VA API driver have stopped as Gstreamer got native support Chromium got native support and FFMPEG has working and I mean ready to merge patches for upstream support was not really meaningful anymore so it was abandoned now what does this do and this is just a little subset of what this effort enables it enables actually using doing blob free decoding on hardware so notably on the left you can see the MNT reform laptop which is a Niamics 8M laptop in the middle you can find the pine 64 which is based on all winner 64 A64 sorry which has the cedrus driver and on the right the prism librum 5 phone which is also an Niamics 8M phone and all these projects that I selected actually aim at having royalty free not royalty free but actually blob free OS for their platform in order to better support open source so now it's showtime I have a live demo for you it's going to be a screen sharing it's a bit flicky to get the screen sharing so I might do a little back and forth but let's hope it's going to work I'm going to present you this little board so this is made by Libra computer I really like what they do they basically invest their personal money into making very cheap integration of various SOC this SOC is the all winner H3 so I'm going to demonstrate gstreamer running stateless codec with latest kernel latest gstreamer and everything on this device and because I could I only have one artifact of my embedded world demo I'm going to play back a video of my embedded world demo in there which I'll explain so let's try and set up the screen share now oh wait I need to plug this in so as it's doing tftp boot I'm going to plug a little ethernet in order to show you I'm using a usb hdmi capture card over here so I'm going to plug the hdmi that and when the screen sharing is on I'm going to put some usb power in order to boot it up so let's get this sharing working it's a bit of a dance voila hope it works so now let's plug this in so we can boot that board let's hope it's going to work oh yeah the boots great so on this board I'm actually running fedora root fs why fedora because this is the same as I use on my pc so I'll have the tool to manage the root fs I can do dnf for the platform on my main pc let's lock in if I turn on the little keyboard it's going to work better there we go super secret password now as I said I'm running quite a recent kernel so it's a 5.7 rc2 I'm not sure which day it was I think I got one patch which I've submitted to the mailing list a little bug in cedrus that was affecting gstreamer but not ffmpeg and now I'm going to load gstreamer on install so and gstreamer has a build tool called gst build because gstreamer is made of multiple repositories which are hard to manage so gst build is an aggregator based on meson in order to build the integrity of gstreamer and it has some facility that allow you to run in place the plug-in so this little common will load a huge environment which will go fetch the plug-in shared object where they are in the build tree instead of having a single directory like you will it takes a bit of a while because there's a lot of file and we're over nfs I don't have any graphical setup so I'm going to use a little common line player called gst play I need actually playbin tree for the negotiation to work properly with KMS there's some limitation of playbin 2 there and that will be displayed over KMS sync and on this device the end most newer device actually the video layers is an underlay so we don't see it by default but the all-winner is very capable you can actually flip that over so I change the Z position to 2 in order to bring it up front now it's pre-rolling it means that it's preparing is buffering in order to display the frames there we go so let's pose there on the left that's the monitor of a streaming server that is D mode there and in three different protocols we're actually streaming to mix 8 evaluation kit which is on the second screen over here this device this is actually well Zodiac but Safran actually next generation of in-flight entertainment device which we embedded a western demo and we're demoing actually dragging around some underlays so there's two hardware underlays and one GL overlay being used there that we can drag around and do so some smooth stuff there so voila that's it for the demo so let's stop this share so now it's a question time I'll try to I got I think I got 5 minutes of question left something so I'll try to answer as many as I can and let's see how it goes so I need to click on this and publish so let's take the first one yes so what exactly is an accelerator what we use as a term for accelerators is a piece of hardware that actually doesn't do the entire work so it's it only accelerate some of the decoding but in in in case of the stateless decoder you still need to go through the decoding process so you you implement the third of the specification of the decoding and the accelerator with the parameters will handle mostly the mathematical part of the processing I don't know I don't know if accelerator was fully proper for this one if you consider the one that provides you algorithm for encryption that that's a better match for accelerators but yeah next question do we also get a performance benefit with stateless so yes but probably not the way you think so when you use a stateless decoder you have to do a lot more work on the processor so you actually wake up the main processor more often so in some ways it could be considered a performance degradation but with the stateless decoder you have a tight a finer control of the decoding and you can reduce the latency introduced by those firmwares which usually will implement the proper decoding model which adds some delays to do perfect timing at the output so yes you can gain some performance there and if you have a chip like this address on the all winner which is slice base you can enable slices which is basically splitting the the encoding in multiple region well it's a contiguous in h264 it's contiguous set of macro blocks and and you can decode them as they arrive over the network so by the time you receive the last slice the remaining time the latency is only the it's only the time it takes to decode the last slice so you actually win in latency this is not something I've tested yet or I've implemented yet but this is definitely a goal in there so you can gain in latency especially for streaming simple question there will the slide be available online I will be uploading it on sked right after so if you have sked just go there you'll be able to get the slides in PDF form this is a test these are masks this one oh another question no same question again did I go over hold the question already yeah I think so oh there's a new question so oops I need to publish yes so the question is could you share the spec of the device used on the demo so it's complicated the device used on the demo is a all winner h3 so it doesn't have a public spec I don't have a public spec for that one it's a reverse engineering effort but I can post on the Slack channel the link to the wiki of the reverse engineering documentation that is found on Sunshe Linux Sunshe group so S U N X I is the term used to reference these boards so I'll try to put that on Slack after this this presentation and voila I think that's it any other questions so effort is not done we still actively work on this thanks for attending I'll be on Slack to answer any more questions or to send you links oh actually no that's the same oh yeah there's another question let's take it I have time so this question is do you use less power with state less decoder the answer is no because some of the decoding process is done on your CPU so unless your main CPU is very is not power hungry at all it would beat actually a co-processor it's going to be more more hungry on power then this one is a G-Strummer specific question so the question is can a v4l h264 deck output the video without copy to GPU memory for color conversion so first thing in that question that I need to correct is the differentiation so v4l h264 deck is the decoder for state full decoder while we use actually a different name for the state less decoder it's going to be v4l to SL h264 deck in order to differentiate them but the answer is yes it's already able to zero copy to GPU it's actually what is being demo both on on the demo I've showed you I'm displaying to KMS and on the demo I showed you inside the demo I'm displaying to the etnaviv driver the zero copy buffers now there's a limitation with etnaviv it doesn't support very well NV12 so we use actually the post processor on the the Hantro vpu in order to convert to YUI2 so if your codec has a post processor that can do some other memory layout or other memory format it can help and you can display there and we're also looking forward adding some modifier and compression support there but for that we need we need support for modifiers in v4l oh ok another question how well does it support decoding multiple stream in parallel are there priorities in v4l stateless so basically the question is how do we parallelize the decoding in a stateless decoder so the stateless decoder is stateless hardware wise so each instance of the v4l driver has a state but the hardware can be multiplex because it's fully stateless you ask for a frame and all this is multiplex actually is very simple there's a lock on the usage of one decoder instance if there's multiple there's going to be multiple lock but that's not supported yet and the linux scheduler will decide who has the priority to take hold on that resource so it's scheduled by the same scheduler that's scheduled your process so if you have a higher priority process it will win the time share on the decoder so that's the current implementation this is of course something that could evolve and improve and the overhead of scheduling is fairly I mean it's about none because it's just it decodes one frame so you switch from frame to frame it doesn't matter which stream this frame is from so another do I still got okay let's publish that one apparently I got more time so is it still good idea to transform a stateful codec to state less is it still good idea to transform so it's a bit a question for purist because if power efficiency is your goal and you don't have any problem with the firmware you're using in terms of multiplexing streams you don't gain anything by going to state less but if the firmware is not good it's not doing a great job at multiplexing and power efficiency it's very small so it's not like your main interest you can gain from going to state less first you remove the dependencies on a proprietary blobs and probably get away remove some of the bugs that you cannot fix because they're in the blobs and you gain in flexibility you can definitely gain in flexibility now is it always possible it depends on the architecture I'll give you an example raspberry pi as a new hgvc decoder and this hgvc has been exposed as state less there's no driver yet but we know that it's a state less and there's a reference code that has been published and most decoders in the raspberry pi are likely similar so they could have been exposed to the main CPU most of the decoders are exposed through another processor but some of the decoder are not accessible through the main CPU you have to speak to the coprocessor so converting them to state less would mean that you have to write a new firmware that expose them as state less and this is quite a lot of overhead in the process of dealing with state less but it would resemble a lot of what GPU do because those accelerators are on GPU and you speak to them with a state less bit stream state less protocol actually to your GPU card so yes there is possible gain there I think it's going to be the final question so the question is this history mentioned here is mostly applied to ARM or is it true for x86 so all x86 codec that have public implementation source code are done through GPUs today AMD, Nvidia and Intel so there is nothing in this v4l to state less interface that currently applies to a x86 platform though the availability of this interface could allow some vendors thinking of matrux as an example people I've worked with they could actually now implement a proper driver for their PCI card which offer an array of decoding accelerators basically they didn't have a kernel API for that now vendors need to decide whether they offer a unified API for all the platforms that's usually where they end up with the proprietary solution or if they really want to support mainlinux so it's really a vendor decision here so that's it for the question, that's it for this talks hope this was useful and enhancing your knowledge of codecs there's another question I'll take it over Slack right after thanks for watching