 So thank you very much for coming to my presentation today. My name is Katsuya Matsubara. I'm working on Linux kernel porting and barrier system integration for embedded systems. Recently, I have been in charge of multi-media integration for some Brunettes as platforms. Today, I'd like to talk about G-streamer system integration for embedded platform, especially about optimized technique for video plugins. First of all, I'll give you a small introduction. What is G-streamer and what G-streamer provides? G-streamer is a cross-platform multi-media framework. Especially for Linux system, G-streamer can be the default standard media framework for embedded. Python has adopted G-streamer as a media server. And there exists under the integration of G-streamer, which can replace the stage flight. With G-streamer, various type of media processing can be realized by describing data flow called pipeline with media handling component named plugins. So there already exist over 200 plugins you can use, and they are maintained with being classified as base, good, but and agree according to its status of implementation quality. And third-party plugins such as GST OpenMax and the GST FFM Pen are available to use. And also, plugins can be classified as source, filter, and sync element for their roles. Source element is used to generate data for output. Example of source element is file source to read a file. Video test source used to be create a test video stream. And V4.2 source can be used to read frames from video for Linux device. And the filter or filter like element can receive and provide data. FFM Pen color space plugin can convert video from one color space to another. And the video crop can be used to crop video image into a sub-region. And QtDemax works as a demaxer for MP4, MOV, and 3GP containers files. Now, of course, the coder is a filter element. And the sync element is used to accept data. FVDem sync, X-free image sync, and DFB video sync can be used for video rendering into screen with frame buffer and the X window and rendering by Direct FV. And if we want to store data into a file, file sync is capable of that. And the GST launch command provided by GSTreamer core package must be useful to describe media processing pipeline easily. So you can just declare a pipeline as a command line argument for describing how plugins are connected and how data flows from source element to sync element. In fact, I've used the command to construct typical video applications. I'll explain later. And before moving on the main talk, I should say the version issue. Currently, the community released two series of GSTreamer version. Version 0.10 is popular and widely used at the present moment. The latest version of 0.10 is 0.10.36. Version 1 is the current stable version since last September. And ABI and API are not compatible with 0.10.36. And some of the plugins are not migrated yet. So my work is based on the latest 0.10 GSTreamer because I started this work before version 1 had been released. Now let's look at goals of my work. The first goal is implementing two typical video applications with existing GSTreamer plugins. Video monitoring is capturing video input from camera and then showing it in LCD screen continuously. And video playback runs with decoding, compress video stream, and display it. The second goal is optimizing performance of video applications. Hardware accelerators in SoC should be utilized for media processing. And the CPU intensive overheads, such as mem copy, should be gone for optimal performance in embedded system. Next, let me introduce my target platform. RuneSas Alkai 1 offers entry-level automotive system to be realized on the SoC. This SoC has cortex N and single core and controllers for various peripherals and hardware accelerator for media and graphics. I'd like to focus on the video processing unit, so this. And the video device and the video processing unit can work as a decoder hardware with small firmware. And the videoing device, so V-inch for short, can be used to capture video from camera through analog video input. OK, then, so much for introduction. Now let me move on the first application, video monitoring. So this slide shows a pipeline for video monitoring. So this command line is GST launch style description. And this figure shows how video data flows from source input to output. And the V4L2 source element read a video frame captured by the V-inch device. And the V-inch can output video as 60-bit RGB image. So the image should be converted to destination LCD color format. So in the case of my target board, so LCD color format is 32-bit ARGB. And so FFMP color space convert video color format. And then DFV video sync element copies the video into screen and frame buffer by stretching breathing operation. So this movie shows the LCD screen on the target board. So the camera is monitoring aquarium fishes. And the video monitoring application with GST running on the board. It works without any extra integration, but the performance is quite bad. Too slow. And why so poor performance? I found these resting factors of the performance degrading. First, in the current pipeline, V4L2 source invoked CPU mem copies on each video frame to copy from V4L2 buffer to G-streamer buffer. Additionally, the V-inch device operates in low-speed mode. So V-inch device has two operations mode, the single capturing mode and continuous capturing mode. And the kernel driver decides which mode can be adapted according to the number of buffers prepared. So that means V4L2 source does not prepare enough buffer for V-inch. Of course, FFMP color space is implemented in software. So in low-spec embedded platform, this must be overhead. Furthermore, stretch-breeding operation in DSV video sync has been realized by DX2P software renderer. So let me look at optimization points for the performance improvement. In V4L2, the CPU mem copy operation should be suppressed. And continuous capturing mode should be activated by supplying enough buffers. And then, acceleration hardware in the SOC must be utilized for color conversion and stretch-breeding. If DSV video sync can realize color conversion with hardware, FFMP color space becomes no longer necessary. The first two points of optimization for V4L2 source can be whereas with element properties. So GS3 element can have some properties which is used to provide information about its configuration or to configuration the element. And the DSD inspect command can show the fat properties an element has. So this is quoted from the DSD inspect output with V4L2 source. And the Q-size element properties is to configure the number of buffers prepared for V4L2. The default value is 2. And always copy properties, control copying from V4L2 buffer to DSD buffer. So we can change the Q-size property value with sufficient number of buffers for the continuous capturing mode. And disable the copying by setting force to the always copy property. Let's move on to the next optimization, utilizing hardware acceleration in DSV video sync. So in order to use hardware in user space plug-in, there must be a requirement we should need. The first requirement is hardware access from user space. User space plug-in has to read and write and register of hardware. And also it has to handle interact from hardware. The second is DMA memory allocation in user space. Hardware can access memory through DMA. So physical contents memory may be necessary for buffers. The last requirement is the address issue. So hardware can only handle physical address of memory. But user space program usually handle virtual address. So if user space plug-in directory manage hardware, so virtual and physical address transition user space must be necessary. And to find a solution, let's focus on the buffer management. The current pipeline is using three types of buffer. The V4L2 buffer can be allocated by V4L2 kernel driver. So hardware device can access the V4L2 buffers because they usually assign to DMA buff memory. But the driver provides only corresponding virtual address of buffers into the user space. So GStreamer buffer are typically allocated by MRLock in user space. So hardware cannot access them directly. And the user space program never knows the physical address of the buffer. The frame buffer can be accessed by hardware naturally. And the physical address of frame buffer can be seen from user space through the IO control to the frame buffer device. So now we have a solution for such requirement I mentioned. So user space IO, UIO framework is a framework in the mainline kernel to control hardware from user space. UIO provides MWAP functionality for IO memory and DSM memory allocated in kernel. And the release is called to UIO device, can notify user space driver of hardware interrupt. Furthermore, UIO provides physical address of IO memory and DMA memory through the cCFS entries. Now I'd like to introduce some open source library to manage UIO resources. dvulmax implements resource management of DMA memory and virtual and physical address transition for IO memory and DSM memory provide. And additional mem copy elimination can be realized before it's user pointer. So before it's user pointer is used to buffer prepared in user space instead of ones allocated to the kernel driver. With the user pointer, DMA memory region allocated by UIO can be assigned to before it's buffers. And then the buffer can be read directly by rendering hardware using corresponding physical address exported by UIO. Now let's check the buffer features again. With the buffer organization I suggested, buffer features can be revised ideally. Before it's a buffer can assign to UIO managed memory and the physical address can be seen through the UIO. And the GST buffers are never used because of eliminating mem copy. So now you can prepare appropriate memory for hardware usage. Then I'd like to utilize the video image processing hardware in LUNESUS SOC. So the VIO hardware can perform stretch, britting, and color conversion at a time. I've implemented a user space library to control the VIO hardware. LiveSH VIO works as an UIO user space driver and provide hardware-accelerated image processing functionalities in user space. So this slide shows the optimized pipeline for video monitoring. So UIO kernel driver allocates the DMM memory and exports it through the LiveIO Max. And the LiveIO Max manages the virtual and physical address transition for DMM memory and IO memory. And the LiveSH VIO controls the VIO hardware directly through the UIO. DFV video sync realized hardware-accelerated stretch britting and color conversion with LiveSH VIO. So let's check the revised video monitoring on the target board. So the performance is quite enough. And the low CPU load because of mem copy elimination. Then I'd like to move on another application, video playback. So this slide illustrates the example pipeline for H.264 video compressed video playback. File source read the media content file from disk. And QT DMX extract video and audio streams. And then FFNPEG decoder decodes the video stream. And the FFNPEG color space converts the video color format to fit the destination LCD color space. Then finally, DFV video sync displays the video in screen. So with a small-sized video, so this pipeline works well on the target board. However, in the case of HD video, the screen may be almost frozen. And after a while, the following logs appear in the console. So G3ma says a lot of buffers are being dropped. And this computer is too slow. So we should use hardware decoder in embedded platform. So fortunately, Linux has released an OpenMax IL-style video decoder for the target RKI1 platform. So OpenMax IL is major standard interface for media components. So it's often used as a standard API of codec engine. For example, Android adapts OpenMax IL as a codec interface. The OpenMax IL specification is simple and flexible. Almost all operations are abstracted as simple buffer operation. And the OpenMax IL component binaries for your target may be possibly distributed by chip vendor or board suppliers. So I just consider integrating the OpenMax component into G3ma. So the GST OpenMax plug-in can wrap and control OpenMax IL component. It can act as a filter component, like decoder and encoder. And also, it can used as source and sync element. The implementation of GST OpenMax for the OpenMax IL 1.1.1 client specification. So you may hope it can easy to integrate your OpenMax component into G3ma. But so unfortunately, most OpenMax components often require calibration against the OpenMax IL client because the spec of OpenMax is not strictly defined. Therefore, you should face vendor-specific requirement. One of the typical vendor-local spec in OpenMax component is granularity of data input. In fact, LUNASASOMX component for 8.264 video expect 1 null unit per buffer input. However, the QTDemax plug-in output 1 frame per buffer. So we should split the buffer, store one frame data into several buffers for each null unit. In addition, QTDemax picks up SPS and PPS unit data from buffer and pass them through the caps. Caps is a framework in G3ma to exchange metadata of buffer data. And it can attach to input and output. On the other hand, GST OpenMax plug-in expects to receive SPS, PPS data from buffers. And another issue is who allocated buffer? So the OpenMax specification allows both client and OpenMax component itself to allocate buffers. However, vendor OpenMax component might only support self-arrowcated buffers. So initially, OpenMax component may require vendor-specific parameters set up. And third of all, we possibly should customize GST OpenMax code to deal with vendor-specific behavior that is out of scope of the spec. For example, OpenMax component need an explicit buffer flushing whenever the seek command issued. Finally, additional metadata about output may be passed into the downstream. So typical metadata for video output are raw stride of decoded image, TL addressing, and the composition of interest image in the buffer. So this is optimized pipeline, especially for fitting Runesus OpenMax requirement. So the legacy H264 pass plug-in can split QTDMAX output into null unit and also extract the SPS and PPS unit from caps and push them back into the buffer. And the GST OpenMax H264 decoder has been customized and adjusted for Runesus OMX. And the Runesus OMX decoder output in the V12 color format. And the VIO hardware can support it. So FFMP color space is gone. So let's watch the HD video playback with optimized pipeline. So there is no frame drop. And the CPU load around 30%, it's not heavy. Finally, let me summarize the point of my optimization for this video. I explained how to utilize hardware accelerators via user space plug-in. So UIO allows the user space driver to directly handle hardware resources. And I've implemented a middleware library for the image processing hardware. It works as a user space driver on UIO and realized hardware accelerated color conversion and stretch bleeding in the user space. GST OpenMax can integrate the vendor OpenMax IL decoder component into this trimmer. But some of adjustment in GST OpenMax must be necessary because of vendor-specific requirement. Furthermore, I mentioned how buffer memory management should be organized for hardware usage. Physical continuous memory and the virtual physical address transition in user space must be essential. And finally, I would like to emphasize it's quite important to eliminate mem copy operations, especially for video applications in embedded platform. Future work. I should migrate to GST trimmer version 1. So GST trimmer version 1 is being redesigned for hardware optimization. So especially, the buffer management offers more flexibility when handling special memory, such as physical continuous or combined chunk of memories. And of course, the migration could be necessary to submit my work to the community. Anyway, I've uploaded my GST trimmer code and related libraries into GitHub. So you can see my code if you are interested in details of the optimization. And thank you for your kind attention.