 Hello? Okay, I think we're good. So thanks for coming. I'm Ramiden Iquomo and as JB, I've been involved for way too long in the video land project. And I'm going to talk about, well, mostly the Linux side of things. And since this is a multimedia music conference, it's going to be mostly about output. Well, that's the official reason. The real reason is that I've mostly been working on outputs. Although if you remember the site reading talk from this morning, I actually did the MIDI support in VLC and that's more like an input thing that I haven't done most of the input stuff. So basically it's a video which is kind of like what JB talked about but on Linux conference after all. So just a warning, this is my personal opinion. It has nothing to do with my employer. I happen to work for, well, graphics chips at manufacturer with a green logo. I'm not going to give any names. And if I speak too fast or do not articulate, please stop me. You can also stop me if you don't understand the word or whatever. I think we have time. So, multimedia pipeline. Well, as JB kind of talks about in the previous presentation, it's pretty much the same in every multimedia framework. It's the same in this framework, the same in Direct Show. It's kind of similar on Android. Even an employer, while being monolithic, has a similar work split between the different components. And it simply comes on the fact that the specifications that we have to implement are the same for everybody and they kind of split the way everybody splits. So usually, on the input side, you have a byte stream reader. Now, the naming, nomenclature might be different from project to project, but it's pretty much always the same thing. So you have a URL or a file pass and out of that, you have components that will implement HTTP, file access, FTP, whatever. Then you have your file format parser. And so far, it's pretty much like any software, like even OpenOffice will work that way. But of course, the format parser is focused on multimedia, so it's mostly about extracting audio and video signals from a file. So they are usually multiplexed together in a single file and you have to extract them and also extract metadata like resolution, codecs, and the timing information. You might have packetizers, which especially for audio is used to... Well, if your file format doesn't preserve packet boundaries, then you might need to regenerate them from the bit stream. But usually, the file format takes care of that, so you don't need to do this. And then you have the decoders. So you have audio decoders, video decoders, and subtitle decoders or front renderers or whatever you want to call them. And then you have what I would call outputs or filters, the interlacing, gamma correction, whatever. Blending and overlays is where you put subtitles back into the video signal and then you output audio and video. And the pipeline is driven by buffer levels, threat control, threat compensation, lip sync, which is all time-based. And that's basically the main difference between multimedia and other desktop applications in that we drive everything on time and we don't load the whole file. We just play it as time goes because, I mean, typical video might be gigabytes big, you just don't want to decode the whole thing in memory. It could be terabytes after decoding anyway. So focusing, as Doug said, on audio and video outputs or audio, turned out that it's not... What you have to do is not that complicated, but it's surprising how bad media audio output APIs are, including on Linux. So one problem with multimedia is that we have what I would call audibly long buffers, and by that I'd simply mean that the length, the duration of the buffer that we typically have is something that you will notice, that a human will notice if you hear it. And that's very different from what you would have in games or in real-time communications where you want to minimize delay because it's interactive. I mean, it wouldn't do if your first-person shooter, you pressed shoot and one second later it actually shoots and you hear the shooting sound. For multimedia, you usually do want to have buffers, and that's because it avoids what it reduces during, especially if you have a scheduling latency due to some other application taking some CPU time. And it also reduces the power consumption because the larger your audio blocks are, the audio periods are, the fewer interrupts you get from the audio chip, and so the lower your power consumption is. Or CPU processing as well. So unlike games and, well, user-button sounds, you have, we need some special, we need proper support for handling long latency in our audio pipeline. And that's a problem because a lot of audio APIs have been driven by either game developers or UI developers, and not necessarily been taking into account the specific requirements of multimedia. And by that it's not just DLC. It applies to GStreamer, to other frameworks that you might find on Linux. So for, so the requirements we have for buffers are relatively simple. We need to be able to maintain limp synchronization, so we need to keep the audio and the video rendered at the same time, otherwise it's really annoying. And so for that we need to have an estimate from the API what's the time difference between the time we actually send the audio block to the API and the time it's actually going to be coming out of your speakers. Which as I was saying is audibly long. Typically we're talking about a few hundred milliseconds, even up to two seconds with some high-end chips, while high-end desktop chips. And you need to be able to control your fill levels, so you don't want to have too little PCM samples in the buffer otherwise you might stutter. But on the other hand if you are finishing the playback like you are at the end of your music, you are at the end of your video, you want to make sure that the last second is going to be played, it's not going to be dropped because you happen to, well it was in the buffer and then you have your buffer duration say one second and then you just close and, oh well your last second of audio is just going to go and drop. It's not such a big problem for long movies because usually the last second is silent but it's really annoying for the same music for instance where usually the last second has some noise. And then you have user interaction requirements. So if the user press stops or just exits a player we need to flush and by that I mean we actually need to drop any pending audio immediately to stop straight away. Similarly if the user press the pause and resume button you want, even so you have maybe, well you have some samples in the buffer you want to be able to stop playback as soon as possible like few milliseconds delay which is going to be a lot shorter times than what your buffer length actually is and so it just wouldn't do to wait for the buffer to empty itself normally because if you do that then user presses pause and then you have half a second or one second of audio going on and then it actually pauses and when you press resume it waits for one second before it actually resumes which is a really bad experience. And of course you also want volume and mute control to be interactive because it's really bad for the user feedback. If you increase the volume and it takes one second before it actually increases it's like you know it's kind of like turning your shower for hot and cold if it takes too long then it's always end up being too hot and then if it's audio volume it's like it ends up being too loud and then it stays too loud for one second and then it's gone. But more specifically with volume control what we actually need is to have a control on the stream that we are playing because typically on a especially on a computer but also on a mobile device you have multiple audio sources that might be mixed. So on a mobile device might be a ringtone or a USMS tone. On a computer it will be email notification or it might be game or whatever. We don't want to interfere with the volume level of those applications and that's something that doesn't always work. And then you have things that I would argue are obvious but seems that for some developers they are not. So we need to be able to enumerate devices. We need to have hot plug. Yeah back in the days when audio computer started maybe 20 years ago when you had some blaster card in your desktop PC then you just add one and you set it up at boot and if you change something, you just reboot the computer but nowadays you have like hot headsets, wireless headsets, USB sticks, you might have the HDMI cable which you plug in and out and which is a new device in a way. So you need to have hot plug. By that I mean you need the application needs to be able to get events when there's a device coming and when the device is going. And of course you need to be able to negotiate your buffering parameters, your format parameters, are you outputting 16 bits integrals, are you doing which rates, which sample rates are you using, which channel are you setting for stereo, monophonic, 5.1, 7.1 so on. And in terms of there's a lot of problems. There's a number of APIs for audio output who confuse total latency which is the real latency that you will experience between the time you actually submit an audio block and the time it actually starts to get rendered. They might confuse that with the actual size of the buffer, i.e. the addressable buffer but usually there is always some part of the buffer that is no longer addressable because it's already somewhere in the circuitry, in the electronics and you can't just undo it. So that's been a problem and the reason for that is just that historically this was not an issue, there was no such thing as non-adressable buffers but hardware has moved on and now this is a problem. You might have no support for pause-resume so you end up or delay when you pause and resume. You might not have flush which means that your stop is going to take a while and that's really well, that's annoying. Volume controls might be for the whole device and what's that I was saying already, other issues configuration might not work properly so there's a lot of APIs that won't tell you what channels are available or you can't state what channels you have, like you only say I have six channels or I have three channels now if you have three channels is it 3.0 or is it 2.1? Well you don't know, if you don't have explicit layout you can't say and device management is also often broken so no evidence or the studies of devices doesn't make any sense. Is anyone using Jack, raise your hand here? Okay, that's not many, I would have thought such a conference would have more but anyway. So Jack is nice but it's really specific, it's really targeted at low latency playback so it has manual routing so you just have a UI for Jack where you would say okay this software, like this VLC media player, it's routed this way through these filters and then it goes to this output and it always uses a single position floating point output so no digital pass through like a speedy and what it effectively does is that it works around all of the requirements, it doesn't actually avoid them altogether, since it's all latency you don't have a latency problem which is the main problem and you also don't have device enumeration problem because everything is manually done post media player but the problem of course Jack is not really adequate for general use it's too complicated. So what a lot of people use on Linux is ALZA of course which is a low level API and also a mid-level mid-layer middleware kind of API because you have both ALZA at the kernel level and then you have the ALZA library which provides both direct access to the kernel API and also a bunch of extra convenience functions on top of it and it you'd think since the guys are running it I actually have a clue and it's been in Linux since quite a long time now it came in 2.5 and it was already being developed as a patch in 2.4 there's still a bunch of problems hardware capabilities especially are annoying because as that states being a device driver API originally it basically tells you what your audio chip can do but it doesn't tell you what it can actually do because you might have I mean if you have a desktop for instance you typically have 5.1 or 7.1 audio card and if you look at the back of your computer you'll have like the 4 or 5 connectors just because you have 5 jacks doesn't mean you have 5 sets of pickers typically you only have stereo and there's no way to know that from ALZA I tell you yeah sure I can play 6 channels, yeah I can play 8 channels except you can't because only the first 2 ones are going to make any sound I didn't know is and there's this plug or plug hardware plug-in which is doing conversion on the fly for stuff that the hardware can't do now if you ask it can I do 6 channels or 8 channels it's sure and it just drops everything that your hardware doesn't do so it doesn't tell you that no well it doesn't tell you that it drops there's no way to know that it's going to drop it so it's not very usable so recently I did support for channel maps which is the way you can explicitly negotiate which channels you have so left, right, middle LFE back right, back left whatever and I think I guess that comes from HGMR requirements unfortunately it's a recent addition and not all drivers support it yet and again just like for channel count it tells you what the hardware has it doesn't tell you what actually is wired so if you have a 7.0.1 card it's going to give you all the 8 7.0.1 channels but it's not going to tell you which one is actually wired to anything even so nowadays hardware usually has Jack Detect there's no easy way to use it in Alza so in practice all software with Alza outputs are just default to stereo and you have to go to some application specific setting somewhere to change that so in VLC we have in the Alza plugin configuration I have stereo and I have 5.1 I have 7.1 and every other application has to reinvent it and a similar issue for digital outputs it can tell you that your chip supports the speedif but it's not going to tell you if you actually have a digital output connected so you have to disable it by default and wait for the user to enable it explicitly and perhaps what's annoying me most is that there's just no stream volume and even actual volume control for the whole hardware a complete mess in Alza because they just don't really abstract anything so they just give you all the different controls that your low level hardware has without any kind of usable abstraction that software could use which would be basically like a main volume and then a stream volume and what I'm also annoying is device management is kind of a missing you can animate devices but you cannot get events when there's a change instead of device in theory you could do that at least for devices that have a device driver except it doesn't really work because when new dev tells you you have a new audio device it turns out that there's some post processing that Alza needs to do before the device actually works and there's no way to know when that is done so if you just try to use the device right away it just fails and another issue with device management is that what Alza calls a device is actually a speaker configuration so if you have one audio card you have this audio card with 5.1, this audio card you can think of the audio output I mean what you would want to have is ok I have the internal audio output I have the HDMI output I have my headset unfortunately that's not what Alza provides so in practice Alza is not really good for high level application usage there's also OSS still exists there's some people who I think kind of lack clue but they seem to think that open system OSS is a very questionable API it's based on audio CTL there's no way to abstract anything everything has to be done in Canon including format conversions and floating I mean doing floating port in Canon is a bad idea typically there's no support for that in hardware or in the chipset processor but to be fair most of the outstanding issues that version 3 had the version that was last in Canon version 4 I think what is good news is that the OSS seems to be dead since there's not been any real updates since for the last 5 years almost so it looks like we might be able to get rid of it except for 3BSD where it sees the official API SNDIO is the OSS replacement from openBSD it's really nothing but not inventing the syndrome it makes each and every possible mistake that you can make when designing an audio API so all of the ones I gave earlier except for one it has per stream audio volume but it doesn't have buffers it doesn't have channel negotiation it doesn't have pose it doesn't have anything it's like the openBSD guys they wanted to remove OSS and Zerick LCD with something that was effectively worse in every possible respect almost it's also been used on Linux for this which is some kind of no mpe competitor which they decided for some idiotic reason to use SNDIO don't ask me luckily that project is also mostly dead but there are people asking for this so Pulse Audio that's more modern thing it's not so recent anymore but that's actually quite good and it's quite well documented now if you've read the other documentation they will have Alza get full bar it tells you okay this is to get your full bar stuff it's not very helpful Alza has some documentation unfortunately when Lennart gave up on that project and moved to system D he didn't in my opinion hand over the maintenance very well and so there's been a bunch of maintenance problems and bug scoping so overall basically unless you are specific use cases like low latency where you want to use jack other than I think Pulse Audio is the only option so video well as Amu was mentioning we need to have YUV support we need to have sub-sampling so for 2.0 4.2.2 where you have less information on colors and on light coming right now this year mostly is a 10 bit support so where you have 10 bits per component rather than just 8 which would mean you would have 1 billion colors rather than whatever 1,016 million interplanar picture format we need to have scaling in hardware we need to have blending in hardware because it's kind of silly to do that in CPU and then the goodies are filtering the entire thing in my collection so on x11 there's different ways to do realistically to do a video rendering you can just use plain x then you have to do scaling in CPU you have to do color conversion in CPU xvideo GLX render video I'm going to go through them later anyway so GLX is basically open GL render is a scaling and then the last ones are accelerated Wayland those are supposed X11 replacement it doesn't really have a replacement for Xvideo and VDPIU but other than that you find mostly equivalencing so it has a GL for instead of GLX and it has a scaler which is roughly to X render so the Xvideo extension is how media players have mostly been rendering video for the last 10 years 15 years and it was meant for hardware but it was designed in the time when graphic cards had dedicated overlay for video rendering as they don't do that anymore and fortunately because hardware was done that way it didn't support compositing and blending so you couldn't add subtitles and the API was never fixed to address the fact that we didn't have these hardware limitations anymore cropping is not very consistent across occurs drivers nowadays it's just about compatibility but the API that the X server provides actually the X server provides so I think it's a long view for our dropping support for it render not render sorry about the typo yeah well it supports only RGB but other than that it's usable except sometimes it's really slow depending on your driver DRM is a low level API provided by the client directly unfortunately it's not provided by proprietary drivers like the one made by my employer and so it's mostly intended as interface or middleware API like GL and video acceleration it's not meant to be used by applications or it's not meant to be used by the application in the sense that VLC or G streamer would be applications and so last are VDPU and VDA VA if they were originally mostly meant as hardware decoding acceleration not so much rendering APIs but that's what we have unfortunately there's no agreement on which one to use so AMD and NVIDIA are using VDPU which is originally an NVIDIA API Intel is pushing its own VA but it's more popular among computer activists because it had open source code from the start XZBA was the older AMD API nobody used it anymore the problem now well those APIs work fine but except for the fact that they are not vendor neutral they are currently lacking IDF so 10 bits but I guess this is coming real soon but I don't know yet OpenGL is basically what everybody is using nowadays the problem is that it's a hell of different versions so you have 1.0 1.1, 2.0, 2.1, 3.0 and now 4 and 5 is coming next year with all kind of extensions so dealing with all kind of different hardware limitations the code gets really complicated but in practice on Linux this is the best option we have nowadays and it's interoperable with VDPU and VA for if you want to have hardware acceleration in front of it you can still use OpenGL at the back so in practice you are just using either VDPU OpenGL but then you have other bugs that's it everybody sleeping? thank you very much for that excellent