 Okay, welcome. So what I'm talking about, first of all, I'm Hans Vakal. I'm one of the co-maintenors of Video for Linux. I work for Cisco. And I thought it was time to give a status update. Every so often I do that, just what is going on in Video for Linux land in that subsystem. And there are two main topics today. The biggest one is hardware codec support, because that has seen a huge amount of work in the past year, year and a half. What was particularly interesting about that project was how many different people, developers and organizations were involved. From Bootlin, Colabora, Google, Penguin Tronics, Outreachy, Linux, there's a project going on now. No, that's not Outreachy. I'm sure I missed people, but that made it really interesting to see all these different companies working together to make a good API for a certain class of hardware codecs. So first, a little bit of overview for how does a hardware codec look. You have, this is user space. User space prepares buffers containing, oh, one thing I should mention at the start. Unless I state otherwise, I'm talking about decoders. So you get compressed frames go in, and you get decoded frames out. It gets really awkward if I all the time have to explain how it works for an encoder. But in an encoder, it's just swapped, right? You get a roll frame in, and you get compressed frames out. So I'm not going to mention that anymore, unless I really need to. So for a decoder, you send in buffers containing the byte stream, basically the compressed byte stream. The hardware codec takes a buffer, decodes it, and the results go out on the other side. So there you get the buffers containing the decoded frames. And at the end, they are dequeued. So user space now has access to the uncompressed frames. One thing that is important for codecs, and I will go through this very quickly, this is not a course in how codecs work, but this bit is important. So that's why I made a slide of it, or actually I stole a slide for it. That's a nice Wikipedia reference there. So you have iframes. Those are basically JPEGs. Look at it like that. So they compress the image independent of any others. And then for video, you also have predicted pictures. So they basically are the diff between an iframe and the picture that you want to create. And you have bidirectional frames. They take both the predicted frame and the iframe and then basically build up the picture from that. I leave it to us as an exercise to the reader to understand why when you do video conferencing, you never have B frames. Think about it, and you will figure it out. But this is the type of stuff that a codec needs to support. So we have two different codecs, hardware codecs. The one is called stateful codecs. Basically you gave it the byte stream. The codec will parse the byte stream, extract all the metadata, and start decoding. So all the states involved in this is kept in the hardware, or the firmware doesn't matter for us. It's a black box and it's all in there. Ideally of stateful codec, you can just give it the byte stream without any regards to boundaries. But there are some that can do that. But there are also, most of them require you to at least parse the byte stream in the user space so you get the frame boundaries. There is a capability or something that lets you know on the user space level whether or not it can parse a completely raw byte stream or not. So this has been supported for a long time. The main thing that has been added is that we really wrote down all the rules in the API on how to do things like seeks, when you have dynamic resolution changes in the middle of the stream, and all these corner cases. They have been clearly written down. Google has contributed much of that work. The detailed decoder specification is now in mainline, or will be with 5.5. For the encoder, there are still a few details that we want to figure out, but hopefully that will go in soon. Basically this works. So several of the existing drivers are being updated to complete, to nicely follow the specification that there are no differences between drivers. We also decided to make a, because we really like to test our APIs, because they are complex. It's video is always complex. So being able to test that without having actual hardware would be very nice. So if I codec driver was written, and it allows us to do the prototyping. So it supports the stateful codecs. It actually has its own codec. It was developed by someone as a university project, guaranteed IP free, by the way. And we are using the compliance utility in order to test this. And that's working very well. We're using that in daily regression tests. The other class of codecs, and that's the main topic, are stateless codecs. So basically these are hardware designers that are too lazy. And they just do the uncompressing part, and everything else, all the parsing and getting the metadata is farmed out to user space. These have been around for several years. And the thing is that you require as an API to be able to send both the compressed frame plus all the metadata to the hardware. And for that we used to have, I developed, I'm one page too early already. So this required a new API, and I will come to that in the next slide. For stateless decoders, we now have several in the kernel. We don't have stateless encoders yet, but they will appear. The detailed specification, again, we really need to specify this very clearly, how you do seeks and dynamic resolution changes and all these things, that's been merged for stateless decoders. And it will appear in 5.5. Again Google has done a lot of work on that, it's very, very nice. As part of the Outreachy project, Daphna Hischfeld, and I saw her in the public, there she is, she did the work of adding stateless decoder support to the virtual codec driver. And the VVRL2 CTL utility can support it. We don't have compliance tests for that yet, unfortunately. There is also a preliminary version for stateless encoders support, but that's not merged yet because we haven't agreed on the API for that. So since we need to do per frame configuration, this requires a new framework, which is called Request API. And we currently have two drivers merged, one is Cedrus for the all-winner SOCs, and one is a Huntro that is IP used in Rockchip, IMAX 8. I believe it's actually used in some all-winner versions as well, I'm not entirely sure, but I heard that. Currently supported codecs, MPEG2, A264, HCVC merged last week, so this is brand spanking new, and VP8. I know VP9 is in the works, but I haven't seen patches yet. All of this is in staging. I want to keep it in staging until we also have stateless encoders, so we're sure that we didn't miss anything when we wrote the APIs. I also want to have it matured a little bit that we didn't forget anything that you actually need in order to decode complex A264 streams, for example. But I'm hopeful that it will be moved to mainline, let's say first half next year. I'm a bit careful about that. The Request API, so that is actually, it was basically the biggest holdout. It was the main reason why it took so long to be able to do stateless codec support. You need to supply both state information and a compressed frame to the hardware. A long time ago, when Google started, Chrome OS started to use stateless decoders, they had exactly the same issues. And at the time, I wrote an API called the ConfigurationStore API that used the control framework. It's originally, it was designed to set up controls like brightness and contrast, but it's evolved quite a bit. It's much more powerful now. And it used the control framework to associate controls with a frame. So you could actually do this per frame configuration. And it turns out that Chrome OS have been using that for quite a few years. But it was blocked for mainline inclusion because it simply was too specific for that one use case. And the intention was to make it more generic. After several attempts, and frankly, that's a presentation in itself, lots of things went wrong. So it took a really long time before we finally had a request API that was generic enough. It will still need some more work for the more advanced use cases. But for codec support, it's now perfectly fine. So the basic idea is you create, this is not we stole that from Android. So you have a request object, you associate. You're buffer with it. You associate the metadata with it. Then you queue it to the hardware, to the driver. So you have this sort of opaque objects kept internally with all the information. And the driver then interprets it and programs the hardware. So as I said, currently it is only used by codecs. But we want to use it for complex camera pipelines as well. The API will need changes or improvements for that. Because it's not, the internal framework is not powerful enough to do everything that we want to do there. I have, there's no ETA. I have no idea when that will happen or how, but that's been the intention. So stateless encoders, so for the state information that is all set to controls, except that when you set a control, it is actually set in the request object and not directly in the hardware. Basically all the metadata that is specified by codec standards is made available as a control. It's, effectively, it's a struct that you store in memory. And since this is all standardized, so all the information, all the information in these structs are completely specific to a codec. There's no hardware specifics in there. It's just, you will actually see in the headers, you will see the references to the sections in the standard that they implement. This would allow you to have, so different hardware implementing the same codec would still use the same API because it's really completely defined by the codec standards and not by the hardware. It has to be like that because a codec, the decoding part of a codec is fixed because otherwise, different hardware would decode a stream differently. That's not what you want. The encoding part gives a lot more leeway to hardware implementers to do fancy stuff as long as it can be decoded in a standard way. But for the decoder, we do not really expect to see hardware-specific controls there. It should all be just specific to a codec standard. So that's very important for the whole API. Otherwise, it would be a nightmare to implement. So we have two devices. One is a video node, traditional, through which you queue and dequeue the buffers. The other is a media device. And that's the one that creates the request object. You could see it as a global device that is not specific to an output or a capture queue. It's just a device that allows you to create a request object, use that to set values there, and then queue it. You commit it to the driver. So one result of forStatus codecs is that user space has to parse the byte stream. For a stateful codec, you can just give it the byte stream. Here you need to parse it. You need to extract all the metadata. You need to extract the compressed frame. And it has to be done in user space. It's a really, really bad idea to try to do this in kernel space, because if there's one thing that you get that's parsing byte streams, that's buffer overflows. You don't want that in a kernel. In addition, parsing a byte stream is actually hard, particularly if you do something like video conferencing, where you may have packet loss. So what do you do if information is missing? How do you fill that in? And that is a lot of the selling points of different companies, how well and products, how well do they handle that? So you don't want that in the kernel. You want to customize that. You want to do your own fancy magic source in order to make this work. So that's why this all has to be in user space. Actually, this is why a lot of people working on decoder software really prefer stateless codecs, because they can do the parsing themselves. They can do all the magic stuff to fill in missing pieces instead of leaving it to hardware, and you don't know what it's doing. So that's the code that you use. It's just a single IOXO for the media device. There you allocate request objects. So typically, if you have, you need to allocate 10 buffers, you typically will create one object per buffer. So this is a little bit of a reminder, because what is now happening, if a stateless codec, a stateful codec, it just kept track of all the iframes and p-frames internally. You don't need to do anything. It's all magically happening. For stateless codec, you actually need to provide that information. So a p-frame depends on an iframe, but you need to tell it which one it is. So you actually also need to keep the buffer around. So the iframe is decoded in a buffer, and you need to actually give it a reference to that buffer when you decode the p-frame. What you also need to do, since b-frames depend on both an i and a p-frame, you actually queue the buffers in a different order. You first do the iframe, and then the p-frame, and then the b-frames. Because when you decode it, when you decode the b-frame, it needs to have references to the decoded i and p-frame. So you're actually reorganizing the buffers, how the buffers are being processed. So how do you do this? You have an output buffer. So by the way, point of view in video for Linux is user space. So when you send something to a hardware, that's an output buffer. You go out, and when you receive something back from hardware, that's a capture buffer. You just have to know this. It's a point of view. These days, we would probably call it sync or source or whatever. So first of all, you see that you set a request fd. That means that this buffer will not be queued directly to the hardware. It will be queued to a request object. You also set a timestamp. And actually, for codecs, it's not a timestamp. It's really interpreted as a tag. It's a unique identifier of that buffer. And it's actually what will be used to identify that buffer for a b-frames and p-frames so that they can refer to it. Then there is a magic function that takes a time val and returns it to as nanoseconds, because all the references are 64-bit numbers. Unfortunately, the current API uses a time val structure. So this is a unique way of converting the timestamp to a 64-bit number that you can then later use as the reference. It's awkward, and we hope to fix it in the future. But for now, this is the way it has to work. So you don't need to know this in detail. But the important bits, these are the structures, the controls for the MPEG encoder. You can see that. So this is a sequence header. It has a reference to the standard, where you find all the details and the meaning of the fields. So this is basically just a reflection of the standard. Same for the picture header. Again, complete reference to where this is defined. Here is the actual control, slice params. And this one has two reference timestamps, so use 64. So you would fill these in. For the B frame, you would fill them in both. For the B frame, you would fill in one. For the I frame, they are ignored, because it's not depending on anything. Then this is the control ID where you pass in this structure, which contains the sequence and picture structures. So a control really contains just the metadata associated with the codec. And this is a requirement, so you cannot change this. Because if you do that, if you make it hardware specific, then it's no longer a generic API. And that defeats the whole purpose. So these control definitions are currently not public. So they're internal, kernel, internal headers, because they will change. MPEG-2, VP8 seems pretty OK, but for H264 and the very new HEVC, we still see some changes. And the little nitty gritty stuff is being defined that we forgot. So an application will parse the headers from the byte stream, fill in, fill this in, and then set a control. And it would do it like this. What's the best size? So this is the structure itself for the control. You fill in the parameters. In this case, this is a P frame, so you fill in the backward reference as well, that you got from when you queued the iframe. So when you queued the iframe, as you saw, I set a timestamp that defines the reference ID, reference tag for that buffer. And you would use that here to refer to that iframe. Boiler plate code. Again, you set the request file descriptor. This means that this control isn't set directly in hardware. It goes to the request object first. And then you actually set the control. The rest is relatively easy. You queued the capture buffer that will receive the result. You queued the request object. And once the request object is queued, now the driver has both the output buffer associated with the request. It has the metadata associated with the request. It has a capture buffer that it can send the result to. So now it can do its decoding step. When it's done, the request object gets a signal. It will send an event to user space saying, hey, I'm done. And if the driver would return additional information, it's not the case here, but it might happen. So it can give you result information, or for example, if you would have a sensor in a more complicated camera pipeline, it could return statistics information. So you could extract that from the request object before destroying it. In this case, you don't need that. You just have to wait until the request is done, because then you know, OK, the frame is decoded. Final step, you dequeue both the output and capture buffer. You're done with it. And when you dequeue it, then the capture buffer timestamp will be the same as the output buffer timestamp, because it's copied. It's copied over. So when you decode an output buffer, the timestamp is copied to the capture buffer. That's done because you need to refer to that capture buffer, because that contains the decoded iframe. So when you queue a p-frame, it needs to refer to that buffer, and that's all. The administration is all done internally in the kernel, so you don't have to worry about that. But you have to really understand how these tag values are copied from one, from an output buffer to a capture buffer, and then you can refer to it. So how would that work? You queue a decoded iframe. You give it tag1. So TVSEC in the timestamp is set to 1. Then you do a p-frame. You give it another tag2, and you set the backward reference to 1, because that's the decoded that will refer to the decoded buffer of the iframe. Then you do the b-frames. So there, you will never refer to the b-frame, at least for this codec, so you don't need to set a tag. But you do need to set the backward and forward references of the iframe and b-frame. So you keep doing that. And that's the way this is working. And the nice thing about this is you can actually, if you know the whole sequence, you can actually start queuing buffers for the decode. You don't have to wait every time until one frame is decoded and then send the next one. You can actually do it in advance, as long as you keep track of all the tags. Very recently, last week, we added one missing bit, that slicing support. So what happens is, with A264 and HCVC, you can actually deal with slices instead of full frames. So just horizontal strips, basically, which improves latency. And you queue the slices, and you want to collect them all in the same output buffer. Result of that is that you cannot return this output buffer after one slice is decoded. You have to wait until they are all done, before you can return it to use space. So basically, you need to hold on to that output buffer. Very annoying. You have to hold on to that capture buffer until it's completely full. And you set a special flag, batteries. So you set a buffer flag, basically, saying, hold on to the capture buffer. Don't return it yet. When the system detects that a new frame starts, it will automatically return the capture buffer. But you have a corner case. When you stop streaming, you may end up with one capture buffer still left being hold on. So you need to flush it, in that case, to say, OK, I'm completely done. Return it. So this is new support. Was just added. And that's really the final missing bits for H264 and HEVC decoders that we really needed. So that's about half an hour talk about Codex. And we really did a lot of work in that area. But the other bit that we did a lot of work on is testing. And especially using the visual drivers that we have, which emulate hardware. That has seen a lot of work, a lot of improvements. So the main one is Vivid, which emulates webcams, HDMI receivers, a vertical blanking interface. Very recently, metadata support was added. So again, histogram information, for example, that can be returned in metadata. And that was done by Vandana as part of Linux Kernel mentorship program. So kudos to her. She's still working on it, so hopefully we will get support for touch pads, which basically return an image of the pressure points. And software defines radio transmitter. That's one that's been missing for a long time. Hopefully she can finish it before the mentorship ends. But she's already done great work getting the metadata support in. And by doing this, we actually find corner cases in our API that we didn't think about before. VI-M2M, that's a memory-to-memory processing device, like a deinterlaser or a converter. VIMC, a complex memory camera pipeline. It's seeing a lot of work. That is also part of Helen's work at the University of Sao Paolo with volunteers who are looking into this driver to improve it. And of course, the VI codec, the virtual codec in the driver. It currently implements stateful encoders, stateful decoder, and stateless decoder. Patches for a stateless encoder are there. Because as I said, the API isn't finalized there. So we're holding on to that, holding off. We do a lot of testing. They are part of our own daily regression testing. As a test media script that I use a lot to just test with all these virtual drivers and do all sort of nasty things, like unloading them while you are streaming, just making sure that the API that you didn't introduce regression in a kernel. I believe it's now also used in kernelci.org. I really need to talk to them to see what the current status is of that. Continually improving the compliance tests, I could really use more support there, have some more people working on those. It's getting quite complicated. What I would really like to have is a parser of 8.264 codecs and similar that just takes the byte stream, parses it, and puts it in controls, and extracts the compressed frames so we can actually easily integrate it in the compliance test. A lot of work on the documentation, as I said already. So the whole new API and the stateless decode spec is all merged. So that's very good work. Writing documentation is amazingly useful to find corner cases. So you're writing it, and then you're thinking, OK. But what if I want to stop streaming and there haven't been any buffers acute, what happens in that case? It's been extreme so that I see it as a three. You have the virtual drivers, you have your documentation, and you have your compliance testing, and you develop them all together. And it really helps making a good quality API because you get to see all your errors very early on because you really have to think about it. Sysboot and Syscaller, they have been finding way too many issues. They're now also using the virtual drivers for testing, so that's very useful. Resources. So first of all, the specification itself, you can find it there. If you want to go into the documentation for the codecs, then you just get that link. The proposed stateful encoder specification, since that's not yet in mainline, but you can see that there as well. You have the upstream Git repository of our stuff. There's a test utility for the Cedars decoder that can be useful. All the utilities are found in the Vifero Utils Git repository main list, and I want to emphasize this one. We're keeping a list of open source projects for volunteers to look at, and most of them relate to these virtual drivers to improve them, to get new features in. If you're interested, contact us. There's also so much to do in this area, especially in testing and improving these virtual drivers, make them more realistic. I'd love to get some more people involved. And it's a great learning experience, so I very much recommend it if you have time to look at these. That's it for me. Questions. Oh, one more thing I wanted to mention. Peak review into the future. One of the big problems that we have at the moment, if you are working with video for Linux, is the distinction between single and multi-planar formats. So multi-planar means that Chroma and Luma are in different areas in memory. And the whole streaming I.O. is getting old, and it's getting complex to use. So we're looking into creating new I.Octols that do this much more efficiently and smarter. So it's very early stages yet. We have some RFC proposals, but I think after the codex, this will be the next big development that we are actively looking in, because it's getting too complicated to program in user space. So we want to make that a much smoother experience for people. Also, the cooperation with DRM, that we are a bit more aligned to what they are doing. So little peak preview. Any questions? Everybody asleep? Someone is awake. Is there any difference in performance between stateful and stateless decoders? I think it's next to impossible to say. I don't have any performance results, but I think even I suspect that most stateful encoders and decoders actually do this, the parsing and everything in firmware. So it completely depends on how that is implemented. From what I understand from people who are actively working on decoders is they actually prefer the stateless version, because then they can do what do you do when you're missing packets and how do you preprocess and parse the stream. That is where they want to make a difference. So they're actually very interested in doing that themselves. Yeah, so the question is, the remark really is that a lot of these stateful codecs actually do this, as I said, this parsing themselves. So ideally, it would be great if we could just bypass the parsing step and use it as a stateless codec. So it goes straight to the state. Am I right? OK, so basically, even if you have a stateful decoder and it has the parsing inside the firmware, you still have parsing in the user space if you use gstreamer or ffmpeg, because they use this parse data for other things. So if you use a stateless decoder, you don't do the redundant parsing in the firmware. You just have the one in the user space. You actually save some time, right? Yes. Oh, OK, it's probably the better person than me. So that's actually correct. Some of the parsing is redundant and could be redundant multiple times. Though there are some bits that we have to parse to drive a stateless codec that we don't parse in gstreamer or ffmpeg in order to decode to hardware. Other questions? With a stateless decoder, is it possible to zero copy in the data, or do you always need a copy in that path? No, it should be zero copied. That's the intention. Is it because there is a pointer? Well, you need to fill. So you need to parse the byte stream, of course, that you receive. And you need to copy that into the buffer. So you always have to do that. So that's what I mean. Do you also need to parse the frame data in the byte stream or just the headers? You, I'm not entirely sure if I'm the best person. Nicholas knows a heck of a lot more about that. It's a white question. It's per codec. As an example in H264, you have to parse up to the slice headers. But the rest, you don't have to parse for most of, for the hardware we know right now. That's why it's still unstable. We're trying to get more hardware. So the frame data itself is passing as a pointer and you don't have to copy it into a structure. This is exact. But for the full zero copy support, this is per driver. There's some limitation with some type of allocation of memory and blah, blah, blah. Hi. Could you give an overview of what the current state of GStream and FFFMpegas was supporting the stateless decoders? Does that work today? It's probably Nicholas who knows it best. I do the kernel stuff, right? Just provide the infrastructure so that people like Nicholas can make great things. I haven't worked on it, so don't give me any credit. The entire support have been tested with a patch set on FFMpeg, but it's not upstream yet because there's kind of a chicken and an egg issue. We're changing the API. We're changing FFMpeg. On GStreamer's side, we have a kind of by accident working prototype on the API driver. We'll see what futures will basically present. And I just want to remind you that the API is still in staging, right? It will change a little bit. So there's a lot of work being done here. Also for Libre, ALEC, Codi, people are looking into this. And they really want this, but it is a little bit too early yet to make this a public API. We're still finding little corner cases, little things that we didn't realize need to be in the metadata as well. So the problems are not with the basic API. They are with the controls containing the metadata that those have to be complete. If you miss something there, we like to be fairly confident that we didn't miss anything before making it public. We're not going to be unreasonable, so this is not something that will stay in staging for three years. But let's say first half of next year, as I said, I really hope we can get the first ones out of staging and into mainline. I forgot to mention there's also support in Chromium and there's a port to the new API, because they had support for that for a few years already. Are you done with time? I think we need to stop here. Thank you very much.