 OK, welcome. This is Michael Treta. I am Philipp Zabel, and we are kernel and graphics developers at Pangotronics. And today we want to talk about zero-copy video streaming embedded systems the easy way. So zero-copy means we want to write unnecessary copies. Video streaming, in this case, means capturing video and coding it, streaming it over network connection, decoding it on the other side, maybe some processing and displaying it. And the easy way we will see in the course of this talk. So just to set the stage, when we think about embedded systems, we think about something like presentation and streaming devices that could capture presentation, maybe be plugged between your laptop and the projector. Something like augmented reality, where you have a somewhat mobile device that has a low latency path between the camera and the display, maybe, and possibly captures and codes a video stream of this for spectators or later analysis. UAV video downlink, like FPV flight or just recording video with UAVs at a more lazier pace. And some things like intercoms, which are basically the same things just without rotors and bolted onto a wall. And this talk will be, well, the agenda of this talk is, first, we will talk a little bit about video and graphics and embedded devices in general, about the hardware acceleration units that are present on such devices, how to use zero-copy buffer sharing, and why to use it. We have a case study, which is an IMx6, which is a device which we use at work a lot. So there's a good candidate for us. And oh, sorry. Well, we will tell about what we mean by the easy way. We'll talk about open issues and talk about those parts that are maybe not as easy as we want yet, but will be in the future. OK, and with this, I give the mic. So we are at the IMx6, no, Philip already told about the use cases for embedded systems and video streaming. So what are the basic building blocks for these that you need to implement? So for example, for the video, telephony part, you need recording and streaming. And on the other side, you have to receive and project your imagery, maybe composite it with a nice graphical user interface. If you have cameras, for example, fish I've used, you have to do lens correction on your system. That's the check board you usually see. So if you see someone jumping around with a chessboard in front of the camera, he's probably doing lens correction. And you have to maybe warp your image. So we can see here, this view is projected into a round thing. I forgot the name. So you can look around with your phone and view it. So it still has to be somehow projected onto this. And maybe you want to even transcode. So take one video stream and decode it and encode it with another format. So we are at the Embedded Linux conference. So you probably all know about embedded systems. But still, I will repeat the requirements to embedded systems. Usually, they are portable. Therefore, they have to be energy efficient because they're running on batteries. So you want the battery to last as long as possible. And they should be lightweight. If they fly around, you want to carry them around. They have to be lightweight. You don't want to carry heavy stuff around. But on the other side, you're doing video stuff and audio stuff. Maybe so you have soft read time requirements. You have to deliver your pictures in time, your sound in time. And the data rate might be high in quotation marks, high depends on what other use cases you are usually handling working with. So you have the conflict between limited processing power on the one hand and the video and audio use case on the other side that requires quite high processing power. And therefore, on embedded systems, we usually have specialized co-processors for that. So graphics processing units, which do all the 3D stuff, video encoders and decoders that are able to encode and decode videos in hardware. Maybe even special FPGAs that are doing your use case in a special hardware device. You have the camera that captures your imagery, a display controller for outputting your imagery. And maybe a network controller for streaming it over some kind of network. All these co-processors usually use some special format and how they store the images in memory. For example, here we have some tile format, which is because of cache locality used by GPUs, primarily. Or you have youth data, which stores the different color channels on different planes or different memory areas. Or even bio format, which is primarily used by cameras because our eye is more receptive to green than to the other colors. So you want to have more of that. And if you want to have your hardware, speaking with each other, you have to somehow convert between these formats. So your camera will deliver bio format. You want to transform it with a GPU-entiled format and then encode it with your format. And that needs copy and conversion. But we don't want to do that. We want to have zero copy. So Wikipedia describes zero copy as computer operations in which the CPU does not perform the task of copying. So we don't want to do the copy on the CPU because it's expensive. So usually the CPU memory bandwidth is smaller than the hardware accelerations. And you have to deal with cache management if you do the copying on the CPU, which can be also very slow in some cases. How slow? We will see later. Oh, no, we will see right now. Okay, so our example system is an IMX6, which is, I think, a design that was released in 2011, something like this. It's up to quad core ARM CPU, Cortex-A9, around about one gigahertz. There are some that are a little bit higher frequency, some a bit lower. And usually in video use cases, those systems are bandwidth constrained. So the memory frequency of those systems is up to 533 megahertz DDR3 SD RAM at 64-bit. Those are 1066 megatransfer per second, meaning there are eight gigabytes per second moved on the DRAM link. Of course, there's overhead. There's opening pages and closing them and wait times. So we don't expect to get eight gigabytes per second memory bandwidth. But realistically, what we see is when using the hardware units on IMX6 quad this generation is just something of the order of 2.5 gigabytes per second. On the later released IMX6 quad plus, this is quite a bit more. I think it should be on the order of four to maybe even five gigabytes per second. But I'm not sure, and this is not our example system here. So memory constraint already on the interconnect, on the memory bus side. But if you look at the CPU cores, what you realistically get from just a mem copy, copying stuff around this, in the order of 500 megabytes per second, this number is not really, well, measurement at a certain point that I can give you the numbers and the conditions. But it's round about the numbers, what I see. So if there are hardware units that are using the memory bus at the same time, this number may go down to something like 300 megabytes per second. And just for comparison, the bandwidth necessary to store one ADP 30 FPS UF video at chroma subsampling of four to two. This is half the width and half the height in the chroma planes. Takes about 120 megabytes per second already. So we don't want to use our limited CPU time and the memory bandwidth of the CPU to copy stuff like this around. And also there's cache management overhand. Whenever we want to access this memory cache, which is usual if you have to reorder somehow, we have to flush all the caches and invalidate them when we get new data from the hardware units. So we don't want to cross this border back and forth at all possible. In case of the IMX6, the hardware units that we are looking at besides the CPU is the GPU that you have in those systems that can do things like composite graphics together, warp, it can do very well, can do some simple computations as soon as the computations get more complex. It also gets really slow because it's a small embedded system. And we have the display interface which outputs the result of our rendering on the receiving side of our streaming setup. And we have the capture interface which is connected to camera or maybe HDMI to CSI converter which takes HDMI input from some external device and pipes this into the SOC which is on the sending side. And then we have the video processing unit VPU which is basically in this case a small DSP that controls a few fixed function hardware blocks and various combinations to do encode and decode of H264 in our case. And those are connected over an XE and HHB bus. Interconnect to external memory and as a detail the video processing unit also uses the internal S-ROM for small local calculations and for motion vector lookup and stuff like this to the SIDROM a lot of times in a very small area of the picture. Those hardware devices are driven by different drivers. For the GPU, we have the AdNaviv project which is comprised of AdNaviv kernel driver which is a DRM driver doing the memory management and handling the command buffers of the hardware and it provides a DRM interface to user space where the Mesa and AdNaviv driver does all well the things that are necessary to implement the graphics APIs that you use when you program this thing actually. So in this case, it's OpenGL I think we are at level ES2 at the moment. For the display path, we have the IMX DRM driver which is also a DRM KMS driver. So the DRM side of the APS for buffer management the KMS side is for mode setting and page flipping. For the capture device, we have IMX media which is a video for Linux driver which is in staging at the moment. And for the VPU, there's the Coda video for Linux driver which can encode raw streams into H264 or decode. There are a few other formats I think MPEG2 and MPEG4 are supported at the moment and some others that are not yet. And the way to pass buffers between those different drivers that are across subsystems in some parts on Linux is the DMA buff subsystem where hardware buffers are referenced by a DMA buffer object which can be passed around and attached to by other hardware devices. And we'll see how this works. So I've painted a little graphic. I have to explain what is visible here because otherwise if I had written everything down the graphic would have been a little bit overloaded. To the left side, there is our capture setup. The kernel driver keeps a queue of buffers. In this case, the two blue blocks where the capture device writes into in a round job and fashion. And then there is the video for Linux to API which manages de-queuing and queuing buffers which means the ownership of this buffers past the user space or down into the kernel and whenever the buffers are owned by the kernel the hardware may write into them. So the way data is gotten out of this is to emap the buffers into user space and then an application can get at the data take the frames one after another and do something with them. Without a method to pass a buffer directly into another driver, we would have to read those memory mapped buffers from user space and copy them into the buffers, copy the contents into the buffers on the input side of the next device that should process them further. So in our example setup, a capture device that encodes in stream somewhere, the second device would be the encoder which has two queues because it's a memory to memory device. It takes on its input side which in video for Linux speak is called output side because traditionally those devices were output only. So on its input side, it has a queue of two buffers in this case which can be emapped and written into and then the device would in a round job and fashion write the compressed results into its output side buffers which are called capture again in video for Linux. With DMA buff, what happens is that the buffers are not exported to user space by mapping the whole address space into the application but there's a special I octal that exports a file descriptor that references the DMA buffer that describes this buffer, this memory buffer and this file descriptor can be duplicated, can be passed to other processes and from there or from the same process, this file descriptor can be passed by another I octal into the kernel back again in a completely different driver and what happens in the kernel is that this buffer is resolved into the DMA buff and then there's an attachment object created that attaches to this DMA buff which manages the lifetime of the original buffer so if the user space decides to clear the queue on the capture device, the DMA buffer will still be around as long as the encoder device is using it for example. So in this case, there's no read on the CPU, the user space process never sees the memory space that corresponds to the buffers of the capture site and there are no real buffers that are backed by memory on the input side of the encoder device, it just gets the handles and then reads directly from the buffers that the capture device wrote into before and this works the same way for the other side where the decoder device somewhere gets its buffers, writes the decompressed buffers and then those file handles are passed to user space and from there into the next driver in this case, the display unit which is just decoding and displaying directly. The way this works in practice is that for video for Linux there are two IOTs that directly are directly involved in this. One is the XBUFF IOT which produces this DMA buffer for a given video for Linux buffer and the other one is QBUFF which is part of the normal Q and DQ mechanism and for memory map buffered it just moves the ownership of this buffer around but for the A-BUFF type Qs you can pass in the file descriptor and we use it from there on. For DRM we have the same and this is where this originated in the first place. The internal handles of those DRM devices can be converted into an FD and from the file descriptor back into this handle either in another DRM device or in a very fairly next device. And what you usually use if you talk to graphics devices is some kind of API for example for the OpenGL generation of things the interchange format is EGL images so the relevant extensions for this are listed here. Those are buffer import and export extensions to EGL that create EGL images out of buffers or turn them into file descriptors to be passed around. I have two little examples for the video for Linux case and for the EGL case because those are what we usually use when we write an application. Don't be alarmed. This is the export case at the left side and the import case at the right side. So what you would do for example at the capture device is open the capture device and in this case I have used Dev Video for Linux by name CSI so this is a Udev rule that automatically is automatically created by Udev for the video device given its name for example and you will see later why I wrote this and not Dev Video Zero or something like this. And after opening this device, the second thing that happens is this IOCTL call using the vidIOC reg buffs IOCTL. This is a request for buffers and the structure that is passed to this IOCTL specifies how many buffers we want, what type of queue should be opened. In this case it's a capture queue for the capture device and we want the type of the queue to be memory map memory so the memory map buffers so the driver actually allocates memory for us and we can map them to user space for example but in this case we won't do it. We call the export buffer vidIOCX buff IOCTL give it the buffer index the only one we created and we will get back in FD in the same structure and this can be passed on to the next device. On the right hand side, there is an example for the output queue which is the input of a memory to memory device so in this case would be for example encoder which just gets requested to create a queue of buffers in this case DMA buff type which means the driver doesn't create any buffers actually it just waits for file descriptors to be queued with the queue buff command and this is the example call in the end. So usually the left example is self-contained if I just add headers there this will compile and work. The right example obviously needs DMA buffer from somewhere so this is not all that is needed and if you want to stream video obviously you have to do format settings before unless you are happy with the default format and start queueing and dequeuing start streaming and queueing and dequeuing buffers. In the case of OpenGL ES there's those two extensions that I described earlier which create an EGL image which is the EGL transfer type for those external or buffers to be passed around between the APIs and in this case there's the structure that you have to fill, you have to give some width hide format and the stride and up there there's the DMA buffer from this an EGL image handle is created and this can then be used in any coronas API that sits on top of EGL for example OpenGL could be OpenCL could be OpenVG even to create something this API can use. In this case this is using another OpenGL ES extension I'm sorry to create an actual texture in this case external which means if possible when this texture is bound and sampled the driver should directly sample from this texture. This is a limitation of the AdNaviv driver that this doesn't work currently but maybe we will talk about this in the very end of this talk and for export there's another extension which can just take one of those EGL images which could be created from a texture or from a render buffer and we get back out an FD which we can pass somewhere else. So easy, right? So that's quite a lot of code we just saw and that's definitely not the easy way I would say. So what is actually the easy way? Entering CheeseDreamer. So this is taken from the CheeseDreamer website. So CheeseDreamer is a library for constructing graphs. So you have some, it's best if I show some example of CheeseDreamer, don't be scared. That's actually easy. All I did is executing the command line on top. GST launch, playbin and some file and CheeseDreamer did all the stuff down here. So what we see here is some decoding and some displaying and but all the stuff is hidden below the and done automatically by CheeseDreamer. So using CheeseDreamer is actually pretty easy and all the heavy lifting is done by CheeseDreamer. So some more information on CheeseDreamer. There are, CheeseDreamer calls this sync for output so you can support many different outputs like Wayland. You can stream it via WebRTC or put in your QML interface. You have various plugins that you can use for video for Linux as we just saw. So we saw the pipeline and the APIs that CheeseDreamer is using or for OpenGL. Also, CheeseDreamer is using them or you can add any third party plugins for your special use cases. There are various language bindings for common languages. So you can just use CheeseDreamer from your program in that language. And as I said, CheeseDreamer does all the heavy lifting. You have auto plugin for that. So you can decode stuff, encode stuff, just play it on whatever interface CheeseDreamer thinks is best suited right now. I will have a quick look into video for Linux. So there are various elements that use the video for Linux devices. The elements support DMABuff import and export. So you can use them with, or you have zero copying if you use them and your driver supports it. If you want to know more about this, check out Nicolas talk from the CheeseDreamer conference about exactly how zero copy works with the video for Linux items in CheeseDreamer. Second element I want to look into is the KMS sync which just uses DRM display devices. So you add them to your pipeline and it uses whatever display it finds. I found this quite easy to test because it doesn't do more than using your screen and displaying the video. So if you want to have more like compositing, you can use the Wayland thing. So Wayland is a display server protocol. Most of you will probably know it and there is a protocol for DMABuff and then the compositor takes the DMABuff and decides whatever it will do with this memory use it for OpenGL or display it. Yeah, and if you use this item, it's quite new, the protocol is unstable. So your mileage may actually vary if you use it, but try it out, it works pretty well for me so far. So this is our example. We capture, we encode and we stream it somewhere else and below this schematic, I've just put the command line that we can use there. In this case is a video for Linux source element. This is given a device name, again the device name. If I take this device parameter out, it will just try to open depth video zero. And this first parameter IomodeDMABuff instructs the element to actually call the Xbuff Ioctl and pass the buffer on with the GstreamerBuffer object to the next element. So from point of view of the video for Linux element, it doesn't make a difference for the sending side whether we are using DMMABuff or no, it's just an additional Ioctl that is called to get the FD reference. The second element, ah, caveat. The Iomode parameter is necessary up to Gstreamer 1.12, I hope 1.14 will be released soonish. I think the release process has not started yet, but the next version will always do this automatically. So this parameter can be just dropped in the future. The second element is the video for Linux video encoder element specified for the H264 codec. So the way this works is the video for Linux video encode element goes over all the video devices it finds that support encoding. And for each supported encoder format, it just generates one of those Gstreamer elements with the correct name. And the first one will not contain the device name, so it's always fixed. And usually on those devices, you just have one video encoder or decoder device for specific format. So just the default case is enough to just use this element as easy. The output Iomode, DMMABuff import parameter is the same opposite to the parameter that was given to a video for Linux source. So this instructs the encoder element to not create M-app buffers on the input side of the video for Linux encoder, but to create this DMMABuff queue and to pass the file descriptors it receives from the video source into this queue. So the encoder directly reads from the memory buffers that the encoder wrote into. This parameter will not be necessary in the future but support for this is not implemented yet in mainline. So this is something that will happen that this is automatically negotiated. And after this, there's an RTP H264 payloader element that just adds RTP headers, splits the video frames apart, the encoded video frames and adds headers and passes this on to the UDP sync which just pushes those buffers into the network stack and streams them away without any parameters given to the UDP sync. It's just streaming to local host port 5004, I think. And on the other side, it's as easy as doing the opposite parts to receive and extract the buffers. There's an UDP source. In this case, we have to give it the RTP type and payload parameter. This payload 96 is, I think, the first dynamic payload number in the RTP standard RFC which is used for H264 streaming. So the RTP H264, the payloader element will accept the stream and not drop it. Then after this depay loader element which takes the UDP packets and removes the headers and puts them together into H264 images, the H264 parser element is a stream parser that just looks at the stream for correctness. This is not necessary, but because this element drops all P frames, those are those frames that reference older data, it will just drop all frames in the beginning if started while the stream is already streaming and only start to let through frames after the first I frame which is the first frame that has enough standalone data to be displayed correctly on its own because otherwise the first few frames would be garbage. So those buffers are just path proved to the video for Linux H264 deck element which again is a decoder that is specified for the codec automatically created by GStreamer from the video for Linux video deck element. And again, we have the IO mode DMA buff parameter which will not be necessary anymore in GStreamer master or the next release. And those decoded buffers on the other side get passed into the valencing in this example which just opens a window when your best compositor is running for example and displays them, composites them with the GPU over anything else that is displayed on this compositor as on your desktop. And also valent compositor will after the frames are composited together, flip them onto the display interface. So this is easy to use but apart from those two parameters that will vanish in the future, there's one more ward that is to be removed and this is the camera input pipeline. We have a pretty complicated setup on the IMx6 in that the capture pipeline is not just a capture element that is connected to camera and dumps the contents, the stream directly to memory but there are quite a few elements and those elements in gray are those that we don't use for the simple case which can scale, which can scale two different outputs at the same time and then there's the second setup so there can be up to three cameras connected and the path can be maxed around and we have a lot of video, dev video devices, those are the yellow boxes in the bottom but we just need the top left part of this. So at the moment, this is not configured automatically and those are, okay, we have to stop in a bit. I will just go quickly through the next. So we just have to call those things and they are per board at the moment. In the future, we want to put them somewhere in the device and maybe auto negotiate those formats. Pawele Majek had the same problem at the talk, I think he did the day before yesterday. So quick look on the future work. We have to get a useful default media controller configuration so that we don't have to configure the camera pipeline anymore. There are some features missing in Mesa and Etnavif like NV12 and your texture import as Philip said before, there is, the driver cannot yet directly sample from linear buffer so it has to copy it into a tile format before and we would like to have open CL support because the lens correction is much easier, no the buyer thing is much easier with open CL and Western gets much more powerful if you are using the atomic mode setting patch set which has not yet landed on master but supports the display paints and improve your output. There are some open questions so there should be a way to auto configure the camera pipeline as we saw before so not just have a useful default configuration but automatically configuring the pipeline depending on your use case. There is one remaining proprietary blob in the system which is the VPU firmware and if we run the pipeline in order to use the video for Linux devices we have to run it as root but we will make that your streaming input may be untrusted so you don't probably want to use that. There is the pipeline project which might help to get rid of this so the conclusion is we have various co-processes on modern embedded systems which can do zero copy in Linux using the DMA buff abstraction which is pretty useful but you should use Gstreamer to handle all the ugly details so that your application program is just programmer or Gstreamer pipeline and it's doing zero copy but still you have to know your hardware so there might be some limitations on which co-processes can talk to each other. You have to be aware of corner cases so if the drivers do not yet support some case like the NV12 case you should always check your Gstreamer pipeline so if you use GST launch it might not actually be what you're expecting so if some driver is not initialized correctly it might just fall back to software encoding or decoding which is slow of course and if you are using driver blobs they may use zero copy they may not you don't know it gets hard so if you want to have zero copy you should definitely avoid blobs. So that was about our talk, thank you and if there are any questions feel free to ask we don't have that much time but you can still talk to us somewhere. Thank you.