 Okay, welcome. I think it's time to start. My name is Rude Derich, working for Synopsys. I'm going to talk about Gstreamer using Gstreamer to do audio offloading to a DSP that's more efficient in processing audio than the application processor. Quick outline. I'll start with introducing the buzzwords in my title, then talk about the problem on a solve together with a very high level overview of the solution. If you want to know the details, then please bear with me during the introduction. After that I'll go into some of the details of how we accomplish the Gstreamer plug-in. If there's time, I could do a quick demo. We also have a demo set up at the Synopsys booth outside, so probably that's a better place for doing a demo. Then we round off with some Q&A. First, a quick or short introduction on Synopsys. Synopsys is not that known with software engineers, but it's a big EDA company, Electronic Design Automation. They make design tools for creating chips. Besides these tools, there's also a large portfolio of IP. These are the hardware building blocks that you can use for creating a large sock. We have a pretty big portfolio. All together it already looks like a complete sock, so there's all kinds of connectivity solutions. The ones that are more relevant for this talk are the processors that we have, the ARC family of processors. At the top left here you see a ARC 700. That's a version of the ARC core that actually runs Linux. It's a pretty exciting moment since we are in the face of mainlining and sending out the pool request for the 3.9 merge window to get our ARC Linux mainlined. I hope that will all go fine, I'm pretty sure. Then there's another version of the ARC processor, the audio processor, and that's the one that's more relevant now. I'll tell a little bit more about it during my talk. So audio, well, what's the trend in audio? I think you're all aware that video in resolutions keep increasing, et cetera, et cetera, but also for audio that's actually happening. So with any device being internet enabled, maybe all devices should be able to play all different type of contents. But next to that you get other sources of complexity, where stereo was the norm some time ago, and then it went to five channels, six channels. We're now at 7.1, and it's going up to 9.1, 11.1, even in our homes. Furthermore, driven partly by the Blu-ray disk standards, you get higher audio quality, so 24-bit precision going even up to 32-bit, and higher data rates, 129, 92 kilohertz. And then finally, there's all kinds of post-processing, pre-processing in audio, all kinds of clever up-mixing and down-mixing to support the multi-channel output. But also in the voice domain in smartphones, all smartphones now already have two microphones that will actually go up into microphone arrays, and you will get adaptive beam following that when you talk to your phone, it knows where you are and picks up the best signal, etc. So more and more audio, much processing, and processing demands go up to hundreds of megahertzes. So it sounds like a good idea to actually have a subsystem, a solution for audio that basically takes care of all these requirements. I have a couple of slides showing some of the use cases in the way we think of audio. Typically, we'd like to see audio, like on multimedia, like a graph of processing steps, and we want to be flexible in creating these graphs, so instantiating a decoder, a post-processor, hooking them up, and by that creating more complex systems. The example here is relatively simple, so just play back from a file, you get the data in, you do the decoding, stereo signal here, post-processing, maybe some equalization, volume settings, and then render it to a speaker or headphone, or maybe in a home situation, send it through a digital output, SPDIF here, or it could also be HDMI or so. A bit more complex, we add an input, so a microphone with an analog to digital converter, and then you could do encoding and store it from the storage media, play it back, send it again to speakers or a digital output. One more step, more complex, and now it starts becoming to be a bit more interesting. DTV or a set of box where you get a transfer stream in, that's the multiplex and then the audio, send it to a decoder, typically multi-channel, you could do some down mixing if you have headphones connected to your system. In many cases, if you want to connect, for instance, hook up to another audio system, or from your set of box to a TV, you send the audio over HDMI, and typically you do that also compressed, so you need an encoder to take the decoded signal, and then in parallel to watching that single channel, you also might want to record another, so there's a second chain again with decoding, maybe down mixing, encoding and storing it on the hard disk. This is an example for Blu-ray. Blu-ray is for several reasons quite interesting, so one is it has supports multiple audio streams, so you have the primary stream and a secondary decode. That could be a commentary from the director of the movie, for instance, and although Blu-ray, it's the optical disks now who is still using optical disks, so maybe the disk go away, but the content really stays, and also in new broadcast standards, you see that secondary audio channel popping up. Now these two inputs, and maybe a third one with interactive generated Java generated sounds, need to be sample rate converted, mixed. There can be again all kinds of post-processing, there's mandatory post-processing in the Blu-ray standard, encoded and sent to different outputs. These examples are from more a home, a DTV type of area, but also for mobile use cases, voice communication, speech, you see that the requirements keep growing, so for voice, typically you have the microphone array with beamforming, echo cancellation, noise suppression, then you do the encoding towards the uplink, and for the downlink, also noise suppression, some equalization to get the best out of the speaker in your handset. So as I said earlier, this all leads to hundreds of megahertzes that you can spend on audio processing, and that audio processing, hundreds of megahertzes, typically is not really a problem for a current application processor, where you have a dual or a quad core running gigahertzes. Still we think it's not a very good idea to do all that processing on the application processor, and the main reason is that doing that on a DSP is far more efficient. So both from a power perspective, the R-core that I mentioned, for instance, we did some comparisons with a standard arm with Neon core processor. It's 10 times more power efficient, but also from an area perspective, you see that, well, not even looking at the total core, but just for the Neon core processor, an R-core is twice as small. So basically you can replace the one of the Neons with two R-cores, save 10 times power for your audio processing, and have less area as well. So why is that DSP so efficient? Well, I don't want to go into the details of the micro architecture or modern risk processors. I just want to point out two things, or maybe three, you have the regular, well, a modest five-stage pipeline, but what you can do is add specific instructions like multiply, accumulate, et cetera, maybe some simple factor functions that are really tailored for the audio processing and the audio processing domain. So that's one, special instructions. The other one is add a so-called XY memory. So you could see it as a local memory. So it has fast access times, single cycle access, but it can be used as just regular memory, but you can also directly use the results of that memory in the multiply, accumulate and other ALU operations. So that's very powerful. It's even more powerful because it has its own DMA. So when you're processing one bank, you can store the results and fill new data into another bank in parallel. And finally, there's a programmable address generation. So instead of you having to calculate the next pointer vector, the pointer address, the hardware does that for you and that enables really zero overhead loops when you have to loop through an array of samples or so. Okay, enough about your hardware. Final but word in my title, Gstreamer. Anyone that doesn't know Gstreamer? Okay, I see one hand. So Gstreamer is a multimedia application framework. It's very, very powerful and used quite a lot in combination with Linux and some other OSes. There's several parts to it. The core, which is depicted here in the middle, basically provides an API and support libraries for creating drafts like I indicated with the audio use cases. You can instantiate components and you can hook them up and then data is streamed from one component to the next, et cetera. So all that streaming and buffering is taken care of by Gstreamer. Now, besides that core, there's also many, many available plug-ins for all kinds of audio, format handling, video, audio decoding, et cetera. So also that is a great asset. You can extend the set of plug-ins. So that's basically what we did. We added a plug-in that does the audio processing not on the main core but offloads. And I'll tell a bit more about that in a minute. Just a quick example of how easy it is to create use cases using Gstreamer. There's a couple of steps to it. First, you create a so-called pipeline. Then you create the elements that should go into that pipeline. So a file reader here where you set a property called file name, which would be the file that should be read. Then you create the decoder, set the decoder type. Here it's a multi-standard decoder and you, yeah, don't take mine. You indicate, for instance, that you should do MP3 decoding. And then finally, the last component is the so-called sync. And let's take care of rendering the audio. And here we say send it to an I2S output. All the elements are added to the pipeline and are linked. Linking creates basically the arrows, connections between the different components. And finally, by setting the state to playing, the audio will be decoded and rendered. So it's as easy as that. Gstreamer also has a nice command line utility. It's called GST Lounge. And then after that, you can basically, in a Unix filter and pipes style, also creates the elements and connections between them. Okay, so far the introduction. So what do we want to achieve? Well, do the audio processing on a DSP, but programmer DSP typically is, yeah, more complex than doing it on your main processor. You, well, have to optimize the code for that DSP using XY memories, etc. But you also have to take care of interprocessor communication, send the data over, use shared memory, interrupts, whatever. So we actually want to simply use Gstreamer. And for the programmer on the application processor, he should be able to create and instantiate the decoders and other components that actually run on the DSP, just like it were regular Gstreamer components. So as an application pro programmer, you don't notice you don't have to know anything about the DSP file reader, the multiplexing typically is something you still do on the host, because it can be audio and video together and the video should have been handled somewhere else. But then when you enter the audio decoder, the data is actually sent to the DSP sites and process there. And in many cases, people simply offload a single codec. But you've seen the graphs that I've shown, these are much more complex than just a single codec. So we want to actually offload complete subgraphs, and then keep the data and the processing local on the DSP. So here is just the rendering that we put here. But they can more be more components in between as intermediate steps. I think I mentioned then the goals we want to achieve. Transparent offloading to that heterogeneous multi core architecture. There can be one DSP, we actually also want to support multiple DSPs that you use for offloading. And do all that in a efficient way. So the efficiency gains you get by using a DSP should not be hampered again by copying data from one core to another, et cetera. So now I will piecewise build up the system. We start with the hardware, obviously. And on the DSP, we are not running Linux, but we do put a small real time operating system and some drivers for the IO. Then the most important parts, the codecs, most DSPs typically come with a library of already ported audio processing components supporting all the DSS, SRS, all the different standards. Obviously, you can also optimize the code yourself. But typically, that's something you want your DSP supplier to do for you. Source and sync. So besides the encoding, decoding, post-processing, the source and sync are the representations of the begin and end point of the streams, getting it either from the host or sending it to a I2S or an SPDIF output. And then we add a streaming framework. So I explained GStreamer as a streaming framework. Actually, on our DSP, we also put a very lightweight streaming framework that really helps in offloading these sub-graphs instead of single codecs. On the host sites, we start with the host processor, running Linux and using GStreamer, GStreamer with all available plugins. And then there should be something to communicate with DSP. So that's the block here called inter-processor communication. I'll go into the different options for that in a minute. Using that inter-processor communication, you have the plugin for GStreamer that realizes the audio processing elements. And then finally, in a standard GStreamer way, you can put your application, your player application and user interface on top. So these are the elements. There's two elements I want to zoom in quickly now, and then I go into the details of the host plugin. First, the media streaming framework on the DSP and then the IPC. So what would be candidates for a streaming framework on the DSP? GStreamer is very portable, so that could be an option. Since we want to have a minimal code size on the DSP and be very efficient, we actually are not using GStreamer on the DSP, but something smaller. GStreamer also puts in, pulls in some required components like G Object, et cetera. And all that makes it a little bit big for our DSP. Another likely candidate is a standard called OpenMax. OpenMax IL to be specific. Well, it could be done. I'm not a very big fan of the OpenMax standard. It's really a standard, a standard by committee, a little bit of a compromise. And it's rather complex to implement, or at least to fully implement. There's a technology in OpenMax called tunneling or deep tunneling, which would be great for offloading more than a single codec, but often that deep tunneling is not supported or not implemented. So if you want to offload just a single codec of OpenMax is fine. For offloading larger graphs, you really need something else. And typically, your DSP vendor will have a solution for that, like a framework. The framework we use is called MSF. And there's a few concepts depicted here that are good to know for the rest of the presentation. Basically, MSF, you create graphs by instantiating modules and connecting them. When they run both on the DSP, then the connection is simple. If you have a core crossing, then typically you need a source component and a FIFO. So that FIFO is basically implemented in shared memory. And also it has an efficient protocol that you can do zero copy transfers. So your application can ask for a pointer where it can store its processed data and then send that pointer to the other side. Interprocessor communication. Again, there are many, many different candidates. I think a very good candidate is the recent remote proc solution. Have a look at it as you don't know what it is. I think in Barcelona, you'll see we had a nice presentation on remote proc. Just look for that. There's all the standards, the multi-core API, MPI, things like OpenCL could consider as an interprocessor communication, remote procedural, etc. Again, typically also DSP vendors provide you with a solution for using that DSP from a remote host. We have something that's called MCI-com-multi-core infrastructure communication. There are several solutions and basically the features that we need for the solution I am talking about with GStreamer is stuff like starting, stopping, resetting the DSP, downloading the firmware, DSP images to the DSP, management of shared memory for those five hosts, and then message communication for the control commands that you want to send over. What I consider a nice to have and what is part of the MCI-com solution is actually a full remote procedural abstraction. On top of message passing, of course, you can send a message like enumerate the functions and the commands that need to be done and then use that for calling the function at the host. A remote procedural call, basically you can just call a C function on your host processor and then a proxy is called that actually takes care of the communication and the real call on the DSP. Now, given all these elements, integration, this is how we built the system and how it's demonstrated at the synopsis booth. On the application, or sorry, on the audio DSP, you have the MQXOS, the MSF framework, the remote procedural call, and you actually instantiates a number of processing components that you chain together in a graph, and you can do the audio processing. On the application processor, we run Linux and we have that MSF framework or the proxy towards that MSF framework. So all MSF functions for creating these components and hooking together are available on the host. And I use by the GStreamer element that I will go into details now. Any questions so far? Okay, good. Yeah, so details, I have a few concepts on how the internals of GStreamer work that are necessary for explaining how we do the seamless offloading. And then I zoom into the plugin. So GStreamer concepts. This is what is called in GStreamer terms a pipeline. So it consists of a number of elements, file source elements, the multiplexer, for this video, or sorry, the audio decoder and the VR video decoder. And then the syncs for rendering to speakers and put a video on the screen. Inside these elements, you see these rectangles. These are called pads. And these make up the connection. So they're source pads. That's the pads that are used for streaming data out of the component and sync pads. These are the inputs for the different components. These pads also typically have so called capabilities. And the capabilities are basically a description of the format that the component can handle. So for a forest decoder, the capability here will say, well, I can decode for which I can take for this encoded input and on the output, it will say, I can stream decoded raw PCM data. Besides the components, there's a number of communication and synchronization support. First of all, buffers. Buffers basically are just a header with a timestamp and a description of the format and then a pointer to the payload to data that should be communicated between these components. There's events. Events are used for both up and downstream communication. Events typically contain information like end of stream. So when a song is ending, then you get the end of stream event that ripples through the chain. There's a notion of queries, an application, or the components can ask information about a stream. For instance, what's the current playing position? Or what's the total file duration? Then there's messages. These can be used to actually send info towards the application. So the notifications on end of stream, for instance, go over the message bus. It's a GTK message bus. In Gstreamer, all the elements run in a single process. And the application typically is an old process. So then you typically implement a message loop and you get the object or the GTK messages for communication between the two processes. Final slides on the concepts. And that's about the threading. Gstreamer basically supports two forms of threading. What you can do is actually process on the payload of the data communication, pushing the data to the path of a previous component or an other component. So here you see one thread. Well, when the source component pushes data towards this source path, then a so-called chain function is called in this component on the load of the send or the push call from this component. And that ripples then through through the other component. So not each component needs to have its own thread, which of course is saving contact switches and more efficient. You cannot do everything with a single thread. For instance, if you have an audio sync that actually pulls data out at a certain rate, then typically you had to decouple the source and the syncs. A standard way of doing that is with a queue component. And that's where then, yeah, you can have an active component, for instance, an audio sync that has its own threads and that actually pulls the data and then the audio decoding can be done on the load of pulling the data. So how did we create the offloading Gstreamer elements? Well, it's actually not that complex, at least the concept not. Basically, whenever you create a Gstreamer component that should be offloaded, then the corresponding component on the DSP is created. And whenever you create a link between two components, then the link between the components on the DSP is created. Of course, the situation is a little bit more complex because you need to take care of the inter-processor communication at a FIFO, et cetera. So I have a few examples here on how we do that. First of all, the case where you connect to Gstreamer elements, where the second element is actually not doing the processing on the application processor but on the DSP. When the element detects that the connection is actually towards a remote component, then the implementation of the element creates this additional source pin and the FIFO. And when the processing, when new data enters through this sync path, the data is pushed into the FIFO. Now, the other way around is similar. If it's data that is local to the DSP, a next component actually wants to have the data on the application processor again. A sync component and a FIFO the other way around is created and the data is streamed through there. Next case is two components that actually are both implemented on the DSP. Then still on Gstreamer level, a link is created but nothing is done. There's no data streaming on that and instead on the DSP a link is created. As said, we want to support multiple DSPs. So when both are offloaded and a link is created but they are on different DSPs, again you will see that you need a FIFO in between and that's all automatically instantiated and connected. You don't notice it when creating the Gstreamer application. Threading I mentioned so this is how the the threading worked through in the offloaded case. Here's a chain with two elements on the host and two offloaded elements. So typically data would be pushed into here. If it wasn't offloaded then that threads, that processing chain could actually continue through these components. It might still be possible but we think it's a better idea to have the best asynchronous performance to have the threads stop here. So it still pushes the data in the FIFO and then it ends and for getting data back from the DSPs towards the host, we use the second threads which typically could be either in this component or in that component and that threads actually gets the data out of the FIFO and again pushes it over the next connection. For this connection in the middle we basically do not do anything special. It's no events need to go through this component because the events are translated to events on the MSF level an end of stream event which normally rippled through like this but with the offloading it goes through the DSP and enters here the G-streamer land again. Mapping states, G-streamer components typically have states and basically in the implementation depending on the straight transactions we need to do the creation of the MSF modules and links etc. So the first state transition is from null to ready. That's a state transition basically where G-streamer says you should allocate all resources that are needed by the component. So here we create the MSF modules and the FIFOs. Then the next state is ready to post. During that state transition we actually know what components are connected. When you just create a single component in the null to ready transition you don't know yet what the peer component will be. From ready to post you know what the peer component is and we use that information for creating the the connections between the FIFOs and the components and the the MSF components. Next is post to play. Also parts of the ready to post state transition is something called preloading. So the pipeline actually is already processing until the last component, the renderer that typically has a clock to the site where data samples can be rendered. So the clock is in the post state still pending. So from post to playing the the clocks are started and the rendering and the streaming really begins. Playing to post and the others are just a way around for tearing down a graph again and freeing all these sources. This is a quick code snippet. Basically describing the same as what I had in the pictures. So this is the function that's from the ready to post state transition. For all here is the output pins connections create the actual connections. So it looks why it looks up first what the peer of the component is and then it looks at some properties of these peer pads. So if it's a deep tunnel, deep tunnel means it is a link on the DSP. Well if it's not a deep tunnel then we need a core crossing. So a sync module is created and the pins are connected in the MSF way of working. If it is a deep tunnel then we see if it's a local or a core crossing deep tunnel hang case of multiple DSPs. If it's local then we just connect the pins and otherwise again a sync component and the 5.0 is used. Some miscellaneous topics. Well besides the streaming which of course is important in a streaming framework like gstreamer. There's some more configuration and control. Gstreamer uses geo-object properties and we use them also. So for instance for allocating a single component to DSP0 or DSP1 that's just a matter of setting that property DSPID on the gstreamer element. Events and the stream handling I think I already mentioned it. We need to do that both on the gstreamer and the DSP side. And finally clock and AV synchronization. Synchronization is a topic I can talk about for yeah I could fill another hour. Typically gstreamer has good mechanisms for that with clocks and timestamps in the data. In a reality situation it could be more complex if you have an i2s output or input then that can be slave clocked or it can have a master clock and typically there's also some hardware PLLs every now and then that need to be synchronized etc. So all that you have to take care of as well. Okay well demo I propose to not run into the break hour. So if people want to have a demo just come to me after the presentation or come to the synopsis booth. I actually have two types of demos that I could show. The one at the booth is a FPGA based implementation where you can see all this running. We also have a simulator that runs on a PC using virtual prototype technology which is actually is also quite nice to see and this basically has been the main development vehicle for the audio subsystem hardware and software. So that brings me to the end. Conclusions. I hope I have given you an idea on how we implemented the Gstreamer plugin that she offloads the processing from the the application processor towards the DSP. We built on an IPC mechanism like Remark Proc and we utilized a DSP streaming framework on the DSP to not offload just a single codec but offload complete subgraphs for efficiency. Well this solution is fully Gstreamer compatible and yeah you can still use the Gstreamer way of working of creating graphs etc and also if your application your player was using Gstreamer then you still can can use and reuse that and yeah one of the goals of course was to do this in a way that's the benefits of using a DSP on lower power and higher performance etc were not gone and we actually achieved that so the overhead of the host communication etc is is really really minimal. Which brings me to the questions and I think I start with my own question while you say it's the overhead is really minimal do you have some numbers of course I have some numbers and here's a I think yeah this this pie chart you really cannot see but the smallest parts here so here's one and here these are tasks that are involved in communication so compared to all the computation that needs to be done it's really neglectable you also see it here in the numbers so for different use cases just file playback file playback with SRS post processing and here it's file playback with a Dolby encode you see that the the audio processing take up the bulk of the megahertz budget the peripheral outputs we also measure that separately is a small fraction and then the host communication the inter processor communication etc is is really neglectable it's about a percent certainly if you have a more complex chain and not a single codec that you also okay well that brings you really to the end there's some references here and I think we have five minutes for questions if there are any in principle yes but for video there are a few things that are are typically different video processing although you can do it on a dsp typically if you go to high definition resolution etc then it's better to use dedicated hardware implementations for that and you still can encapsulate these as a G streamer element the other difference for video is that the the data that you're talking about a video frame is a couple of megabytes whereas for audio we're talking about hundreds of kilobytes so typically hardware likes to have contiguous allocated memory and there for video you need to do some some special stuff for audio it's all that small that it's it's easier G streamer especially version one of all G streamer by the way has very good support also for yeah hardware accelerated video processing they have the the contiguous memory allocators etc and target G streamer 0.10 or 1.0 when we started this project the 1.0 was not there yet so we have been using the 0.10 also because most of our customers are still using 0.10 but there's I see for sure no no issues in using this with the 1.0 version also not really additional benefits I don't think the changes were in this area but the trick with yeah making the connections leveraging still a lot of G streamer there and then doing the actual streaming on a DSP that that works as well with 1.0 yeah more questions not that familiar with this whole subsystem but where does the buffering occur typical stream buffering that'd be up front in front of all of this um basically you have if you create a a full processing graph then you have the freedom to put additional buffering in between components that's typically what you do have an additional queue what you could do has so if for instance you're streaming from a network connection then you want to have some some additional buffering etc so it depends a bit on where you need buffering in your system and you can add it and you can configure it typically you want to buffer more in the compressed audio domain because there the data sizes are still smaller and you could do also all your buffering just before the sending it to the output and rendering but then typically it's it's uncompressed so data sizes and your buffers need to be a bit bigger so typically you want to do the buffering as as early in your chain as possible and an exception handling for example if if you have an underrun on your I kind of saw everything going in in one way but what comes back the other way when you have an exception like that um yeah it depends a bit on uh if it's recoverable or not an exception of course uh but I think I I mentioned the G streamer mechanisms of having event and messages that you can send to the application so you can always notify the the application if you cannot recover um furthermore you have as part of the stream uh events or flags in your your buffers that for instance can notify about the discontinuity in the stream uh some data might be corrupted now some of the decoders that we have are actually robust for uh data corruption and at least make sure that it's yeah either repeating the previous frame or uh putting in some audio or maybe do some clever things like like stretching a bit right and that's that's basically part of of uh yeah the the implementation of the decoders where that logic is streamer in terms of this event um so the the support for communicating that type of events and information uh is all in there streamer typically support external hardware or software based um decoders or did you do some special so it's it's typically used uh with uh yeah internal software decoders um doing the media processing in software on the same core where you run the framework uh but nothing prevents you like i've demonstrated to actually use it for offloading and still have all the benefits of the standard g-streamer api and and framework the question is really to g-streamer out the shop with no mods at all yeah pretty much yeah it's really so g-streamer is extendable so the the um the elements the plugins that do the offloading uh these are our special components uh but the g-streamer framework itself we did not touch yeah okay i think we have one time for one final question and then we go to the break yes go ahead yeah it's it's yeah i mean it is still a standard g-streamer that you uh talk so as long as the the components that you want to hook up are are uh share the the same uh capabilities and are compatible uh then you can uh you can do that yes yeah yeah the output so the the i-square s or the spdiv output uh huh yeah um that's basically uh i think well also here you already see it um that besides sending data uh to the dsp it also can get back to an element and here again it's a standard g-streamer path uh where you can uh yeah i have a file uh streamer component here i can show it to you later during the break if you if you like um what we typically do i think i mentioned it here um no not here um i'm not sure exactly what slide but we do the rendering also by the dsp directly um typically the end point in a normal chain will be an alsa sync component so we have a special component that's not going through alsa uh but that keeps the data on the dsp and and uses the output drivers over there okay well thank you for your attention