 Greetings everyone, welcome to this embedded Linux conference Europe session. So today I'll be talking about supporting hardware accelerated video encoding with Mainline. This is especially about stateless codecs, which are not yet supported in Mainline. So we're going to see that with the Hentro H1 example that I'm going to give many more details about throughout this talk. But first let's begin with a bit of details about me. So I'm Paul, I work at Butlin as an embedded Linux engineer where I do some expertise, especially development. I also give a training about graphics display and rendering. So I try to work with open source as in contributing many of my changes to upstream as much as possible. I've worked on the Sidious VPU driver for the decoding part in v4l2, so I worked on MPEG2 and H265. Also did some work to the SunFry DRM driver which supports all-winner SOCs, just like the Sidious driver. So recently I also worked on MyPice CSI2, which is kind of a recent interface for cameras. I live in Toulouse in the southwest of France where one of our office is located. So let's begin with H264 encoding. So first of why do we need to encode video? Well, the fact is that pictures are pretty big and they take quite some size in memory. So if we make a quick calculation for a 10 minute example at 1920 x 1080, take a 32 bits per pixel frame buffer for that. We want to have 30 frames per second and we make the calculation that would be about 142 gigabytes just to store that 10 minute video. So we definitely want to compress that and reduce this size which is kind of the core point of video encoding. So the idea is that we're going to use lots of different methods to kind of compress the data, reduce the size, but keep in mind that doing this requires some heavy calculation so there's always some overhead to encoding and decoding. And there's of course the main concern of encoding which is keeping a good visual quality for a given size. So there's a trade-off here that we want to keep under control to kind of have a result that suits the use case that we have. So there are very different ways to encode videos, lots of different formats. So these are called video codecs. I've given just a few examples here. So H264 is one of those video codecs. They contain both compressed video data which is the actual data that is used to recalculate or create the frame, but also some metadata that kind of explains how the frame was encoded and so how it should be decoded. So this is kind of an encoder or decoder pipeline configuration metadata that is definitely necessary because there are lots of different ways to encode H264, there are lots of different features that can be used or not. So lots of conditionals here that all need to be kind of laid out with the metadata. Just the codec data itself is called the bitstream and this pretty much contains all of the video data that is compressed. And this bitstream is put into a bigger container which has this video data which might also have some audio or some subtitles for example. And this is what makes an actual video file that you can use on your computer. So H264 is known as many different names. It was standardized by ITUT as H264 but it also was standardized by ISO as MPEG4AVC aka MPEG4.10. So MPEG4 itself can actually mean lots of different things. So this is why we're going to talk about H264 or AVC just to make things clear. It's really used a lot nowadays on the web especially but just for streaming generally or just video transmission that can be compressed. It's really everywhere now. So it's used for the TV with an interlaced station where you only update half of the lines at a time but mostly what we're going to look at today is progressive mode where pictures of frames and not just fields of interlaced content. So H264 has lots of compression features some of which can be kind of complex to implement or that might require lots of logic. There are different profiles of H264 that each kind of enable a subset of features. So for example the baseline profile is quite simple and is a good fit for easy kind of implementations and high profile requires lots more logic and has more features and more flexibility to encode videos efficiently. There's also a concept of limits to set the maximum kind of bit rates for the video. It's also about dimensions. And there are some annex specifications as well. For example SVC for scalability so you can add more temporal content or spatial or better quality so basically you add extra data that you can decide to decode or not. MVC is another extension to show multiple views in the video so this is mostly used for stereoscopy which is basically 3D where you have to show one different view to each eye. So those are the main extensions that are used. So H264 has specific semantics and basically specific units which are called network abstraction layer units that each have a data header basically with a specific type that is specified which kind of indicates what the data of the unit is. So for example there are some metadata in ALU such as the sequence parameter set, SPS. So this is metadata for the whole sequence as also the PPS for each picture. So this is for the metadata and we can also find actual coded video data which is called the slice data. It's important to keep in mind that the metadata is just a series of bits that needs to be understood according to the syntax of H264 which is really bit aligned and often conditional. There is a specific format called NXB which will basically put a prefix before the beginning of each NALU so that it's easier to just find the start of the next one. One important idea in H264 is macro blocks which are basically subdivisions of the picture into blocks of 16 times 16 pixels. These are grouped as slices so basically the coded information in the coded slice data NALU will be composed of data about macro blocks. And there are different types of slices which we're going to talk about just next. Okay, so let's look at some compression techniques that are used in H264 beginning with color sub-sampling. So this is the idea that we're going to affect less bits to color information versus luminosity information just because the human eyes are less sensitive to a difference in color than they are in luminosity. So this is done using an appropriate color space, for example YUV-REC709. So we're basically going to convert a regular RGB image in sRGB color space to this YUV-REC709 color space which uses a YUV color model that splits the components between luminosity and two chrominance channels. And we're basically going to sub-sample the chrominance channels. For example with 420 we're going to sub-sample horizontally and vertically by two. So we can basically divide by four the number of bits per pixel on each of the chroma planes or components. So in the end this gives us 12 bits per pixel which kind of reduces the size by two if we have a 24 bits per pixel sRGB input. So this is great because it's not really very noticeable to the eye. So this works kind of well. So another compression technique that is at the core of H264 encoding is quantization. So the idea of quantization is that each macro block is converted into frequency domain using a discrete cosine transform operation. So this is just a frequency-based representation of the picture in 2D. And then we're going to apply a quantization step which is just a number that we will use to divide each of the frequency-space coefficients before rounding them and just keeping those quantized values. So the idea is that the bigger the quantization step, the more we lose quality basically. We lose information on the frequency-space coefficients. The quantization step itself is indexed by a quantization parameter which we can use just to control the quality of the encoding. So another compression technique in H264 is intra-coding which is basically using redundancy within a picture like a wall that would have a constant color without much changes. And so we're basically going to use prediction patterns for example with directions just to deduce the values of the neighboring pixels from the previous pixels. So this is just one way to reduce the amount of information that we need to transmit. So another important compression technique in H264 uses the fact that in videos subsequent picture are often mostly the same in the sense that only a small part of the picture will change or something will move but not the whole scene at once and the background often remains the same. So instead of coding the information of each picture each time we're just going to code the difference between the pictures. In order to do this motion vectors are calculated to also indicate the movement of parts of previous pictures. So at the code time these pictures will be used as references to calculate the next picture with the difference between the two pictures. This is called inter-picture prediction. That's information to carry so this is good for compression. Motion vectors are calculated between the picture and indicates kind of how a part of the image will move or change in the following frames. So in order to do this the references need to be kept decoded in memory. This is a visualization of motion vectors using ffplay from ffmpeg. So this basically indicates how the elements of the pictures have moved from the previous frames as they were calculated with just the motion vectors. More specifically there are two types of inter-frame prediction in H264. The first one is called backwards prediction and it's used in B slices. This one will just use up to 16 of the previous pictures and there is bi-directional prediction which is in B slices. And this one will use the previous pictures but can also use following pictures which need to be placed before the B frame in bitstream order. So in this case the display order will be different from the bitstream order because the B frame will be decoded after the B frame but displayed before it. In order to do this an inter-coded picture needs to be present first. This kind of sequence is called a group of pictures where we have an inter-picture starting and then inter-predicted frames such as B frames or B frames. Keep in mind that using bi-directional inter-prediction will produce some latency because you have to reorder the bitstream to have the B frame after the B frame that it depends on. So you will basically create some extra latency when decoding. The last type of compression techniques that we're going to see is just entropic compression which is a very common form of compression that will just assign shorter symbols to frequent occurrences in the input data. So this is a kind of looseness compression that works pretty well for video. There are basically two main types of entropic coding that is used for the quantitized desperate cosine transform coefficients. So the basic one that is used in default is the context adaptive variable length coding CAVLC. But there's also a more advanced way that is enabled at least in high profile which is called context adaptive binary arithmetic coding or CAVLC. So the first one is some kind of Huffman coding while the second one is arithmetic coding which is more complex but provides better results. So one of the key aspects in H.264 encoding is called rate control. So this is basically controlling the trade-off between the quality and the bitstream size. So there are lots of things to take in account when doing this and it really depends on the use case and what we want to do. There are different rate control modes that apply different policies which will basically all control the quantization parameter QP which basically determines the quality of the output. In the first mode CQP we just have a constant QP for every frame so we have a constant quality and that's pretty simple. In CRF we have instead a constant quality which is not the QP directly but just some parameter that will be used to find the most appropriate QP for example in iframes which will have a slightly lower QP to have more quality and then that will be a bit higher in predictive frames because it's less important that they have good quality. Then there is CBR where we try to keep a constant bit-rate across a group of pictures and ABR which is kind of the same idea but over the whole sequence so over the whole video file which can be kind of hard if you don't analyze the sequence first. So this is a good fit for example for storing movies in a given size while CBR is good for streaming where you want to have a constant bit-rate CRF is kind of a good way to keep a constant quality. So for example to archive a video or something like this and CQP is basically not a very good fit and it's best to use CRF. In order to evaluate the quality we can use a PSNR metric which is a pseudo signal to nose ratio so this will just give us a number that indicates how much quality we have on the signal when we compare it to a reference. So now I'm going to show you a video where we use CRF mode and we just increase CRF so we just worsen the quality as time goes. So we're going to look at the CQP, we're going to look at the signal to nose ratio and we're going to look at the size of that single frame encode. Starting from about 24 I can notice the difference and now the quality is significantly worsening which each increase of the CRF. So we can see the size that is getting lower and lower but the quality is also really bad. So now we're going to look at the Entro H1 H264 encoder so we're going to look at the hardware a bit closer. So the Entro H1 is kind of a common hardware H264 encoder it can also do VP8 and JPEG. It is found in a few embedded ARM serks lots of rock chips actually rock chips use the Entro H1 a lot on many many of its socks but it's also found on one of NXP socks the IMX8MM not the double M because the single M doesn't have this encoder. Depending on the version it can do 1080p at 30 or 60 frames per second with lots of H264 profiles including high and main and it can also do MVC stereo so it can encode 3D but I haven't used it for that. So just a quick a block diagram from the IMX8MM manual we can see basically different blocks that are used for the encoding which kind of are the different steps of the encoding techniques that I've described before. So the H1 is a stateless hardware implementation meaning that it has no microcontroller no firmware running it's just hardware just logic. So as we could see in the diagram it has a preprocessor that can do lots of things including like cropping rotation scaling color space conversion and also image stabilization which I haven't tried yet either. So it will speed out slice NL use so this means no metadata it needs to be recreated and there are some constraints on specific parameters so these are just examples that I found. So these values in the SPS and PPS need to be set precisely to that otherwise the slice NLU isn't going to decode correctly and it's not going to work. It doesn't do B slices on the INP so this is kind of a good fit for camera based recording where you don't want to have the extra latency of making a B frame so it's easier to just do GOPs with IPPPPP and IPPPPP and that's kind of how it goes so no support for B slices there. It will take references for B slices that need to be stored in specific reconstruction buffers so you cannot use the input buffers you have to use kind of the decoded version of the encoded stream because that's what's used for the enter frame prediction you cannot use the input so you kind of also have to allocate specific reconstruction buffers to keep the decoded references around. If you want to use a cab back that's supported but you also have to provide lots of values in tables in a specific DMA buffer. So the H1 has some internal rate control mechanisms which are kind of more advanced than just controlling QP. QP itself is set to a base value and you can also set a min and max that will be kind of the clipping points throughout all of the operation of the internal rate control mechanisms. They are not actually used anymore but I'm still going to describe them just because it took me quite some time to kind of get around them. So the first one is called mean absolute difference. It's basically about setting a threshold for differences in the pixels of the input and output. So you can basically set a threshold value and if that value is reached at some point during the encoding process it's just going to apply a delta to the QP that you can configure. Then the more advanced mechanism is the checkpoints mechanism. The idea is basically that you configure checkpoints at each regular number of macro blocks. So you might want to split it evenly across the number of macro blocks with up to 10 checkpoints. So at each checkpoints it will basically calculate the number of non-zero coincidation coefficients that it has so far and you actually give it a target for this number of non-zero coefficients. Then it will calculate the difference between the target and the count and then it will evaluate the difference between the target and the count that will give it an error and you can basically put different del tests of QP depending on the error. So this kind of allows you to dynamically increase or decrease quality depending on the number of coefficients that you already have. So in my experience this kind of helped with the encoding process especially for constant bit rates. And the way that it works is that there is some feedback data that you can use to basically make this a control loop when controlling these different thresholds and different errors and levels. Quickly the feedback data of the H1 is the following. So you have the sum of the QP of each macro block which you can use given the number of macro blocks to have some kind of average QP which can be useful. You have the number of total non-zero coincidation coefficients so that's kind of at the final checkpoint at the end of the process. Then you get the value at each checkpoint which you can use to kind of give a future target in the checkpoint method. There's also the number of macro blocks under the MAD threshold. Okay so now let's take a look at the integration of this hardware encoder with v4l2. I'm going to start with stateful encoding which doesn't concern our hardware but it's kind of what was already implemented to handle h264 encoding. So there's a specific pixel format for the h264 bitstream. With stateful encoding it produces both the slice data and the metadata but in our case we only have the sliceNL use so no metadata. There are a few drivers that currently implement this using the v4l2m2m framework. They use some generic v4l2 controls to configure the encoder. For example the profile, the level but also some specific features like the entropy mode also to configure rate control like the bitrate mode or the specific bitrate. And some specific drivers also have specific controls like this one for the MFC driver. So usually in these stateful implementations we have a microcontroller with a firmware that will basically do all the states and reference management as well as all the rate control implementation. So this isn't such a good fit for encoder because there are lots of features that are currently not in the v4l2 API that are needed for stateless support. So on the intro h1 lots of parameters can be set to kind of control the encoding and which feature to use but some of those also need to be set to specific values for the bitstream to be valid. So we can't just take any kind of pps or sps or slice header. Now in the stateless case the state is tracked a bit by v4l2 and a lot by user space so we're using the media request API just like in stateless decoding to tie the buffers and the parameters which are also v4l2 controls. We need to take care of the reconstruction buffers that need to be attached to the input buffers and we also need to take care about the references that we need to use. Now rate control in the stateless case can basically be implemented by the v4l2 driver which will read the feedback data itself and apply some algorithm to do rate control or it can be left to user space by providing the feedback data as v4l2 controls for example and let user space kind of deal with that feedback data. There is already some existing h1 reference code out there for example on chromium os where it's used with a downstream kernel driver that is pretty much stateless. There's also user space implementation for the rate control part especially which is available as a libv4l2 plugin implementation. Then on the other hand there is rockchip that has its own implementation for its own downstream kernel which is called mpp and it also supports the h1 which is used a lot in rockchip. So the approach that was taken to support this encoder was to use the mainline hentro driver which is in staging that supports a decoding with the hentro g1 which is kind of the decoder counterpart to the h1 and the two are often found together. So I worked on adding h264 encoding to that driver basically by looking at the chromium os and mpp implementations and kind of following the same logic. I kind of simplified the rate control algorithm but doesn't give such bad results so I'm not too unhappy about it. It was done using the media request API so really just the same as what the hentro driver was always using for the decoding part. The rate control which was constant bitrate was implemented fully in user space so the feedback data was provided through controls and retrieved from user space after each encoder run. So if you want to look it up it's on github.com slash bootlin on our Linux tree. There's a hentro h264 encoding branch where we have the kernel patches and I've also pushed the user space site which is mostly the rate control implementation and just a generic test tool for the hentro encoder. This is what the user space API looks like so we have the encode parameters that take values fields from NALUs so we have the slice parameters the pps parameters also a single timestamp for reference. Then on the right side we can see rate control encode parameters with qpe and the different thresholds and deltas and etc and we also have the encode feedback which is what we get from the registers of the h1 encoder. So this works but the structures are not very generic and they are only listing parameters and fields that are quite specific to the hentro so instead of that a more generic v4l2 control-based approach would be to kind of reuse the existing stateful controls and add extra ones for the things that could be missing or the features that can be configured and that currently don't have controls but also add specific controls to handle references and then for rate control there are two major approaches which could work. The first one would be to do it in user space which would require generic controls but this is quite possible because we don't actually have to use the internal hentro-specific rate control mechanisms. Just providing the qpe and being able to control the slice type or the size of the gop and things like this could be enough. And we could also provide some generic feedback data like the number of non-zero coefficients or the qpe sum. These look like things that can be generic and that can also be obtained on other hardware. The downside is that it might encourage proprietary implementations so for example vendors that would only contribute that driver without any user space implementation for rate control and then kind of do their own proprietary magic which wouldn't work in the free world. Doing it in the kernel instead would be probably easier for user space but would also have some downsides like not providing user space fine control over what's going on and the internal implementations might be a bit rough but it could probably reuse the existing referral to controls for rate control. So this is just some feedback about first implementation that has some issues and that cannot be accepted as is but this kind of gives some ideas about what I think would work for a proper interface for status encoders. So this is a discussion that I'm definitely interested in having so that we can eventually clean up that Hentro encoder code and bring it upstream with a nice interface that may also affect other hardware like the all-winner encoder which is also stateless. If you have more information about hardware that also falls in the stateless category and if you have details about what the register interface looks like or some ideas about what a good API to support it would be that's quite welcome. Okay so that's it for me thank you for attending this talk it's been a great pleasure. Now if you have questions or just remarks or things that you would like to discuss I'm available so feel free. Thanks again and otherwise see you next time and have a great day.