 Our next speaker is a dear friend of mine, Robert East. We will talk to us about David, an 81 decoder. Hello. Do you hear me? Yeah. OK, I wonder why half of the room is here, because you know probably better than David about than me. But let's try anyway. So yeah, oops. OK, so yeah, that doesn't work. Yeah, I'm using KD with Wayland and sometimes I crash it with a LibreOffice. Yes? Why does the clicker not work? Yeah, it works. OK, so I'm the president of Vidolla and I work on various open source projects, including VLC. It's 264, David, a FEMPEG, and others where I maintain some DVD stack and a bit of Blu-ray stacks. I'm annoying to many of those projects when they screw up. So I'm not probably the biggest developer there, but I'm doing everything else. No? Yes? So AV1. So VP9 is AV1 just VP9++. VP9 is a semi-failure. It's a good codec, but no one uses it. When you ask YouTube, they don't even have many people actually submitting video in VP9. And there are many reasons for that. The reason is mostly because there was no spec for a long time. There was a codec, but no specs. There was absolutely no ecosystem, except a few open source tools. Have you ever watched an anime in VP9? Maybe, right? One? OK, amazing. You're the only one, right? You're the only person. But you've never been on the Piat Bay, right? And so a VP9 encoding anime, right? So the format is a failure. One of the reasons also is that H264 was so good and so well integrated that it worked. So they decided, except YouTube and Netflix, no one actually used VP9 largely. So for AV1, they decided to not just do VP10, but to do something that was actually open with an actual ecosystem. And I think that's more or less they managed to do it. It's interesting that basically there is three companies that actually worked on AV1, Mozilla and Cisco, plus Google. And there is an actual Alliance for Open Media, which is open, and where people actually contribute. Well, we cannot do say that about VP9. So everyone has a lot of amazing results. We are talking of 20% to 25% better than VP9, and the same around HEVC. And you can discuss about so many cases. But that's the rough numbers that you should remember, 20% better than HEVC, the good number, 25. And there is a lot of things happening on the ecosystem. There is at least three open source encoders. One is going to be presented just after my talk. There is the LibreOM encoder, which is like the reference one, which is extremely, extremely, extremely slow and a difficult code base. There is a new one that is happening that's called SVT-EV1, which is basically done by Intel. It's difficult to talk about that, because six months ago it was nowhere. They were announced at IBC six months ago and they're advancing fast. It's actually usable to produce some videos. But a lot of non-open source people are working on AV1. There is Eve that is a very good encoder about AV1, extremely slow also. Atem, harmonic, bit moving also are doing a lot of things around AV1. There are actual hardware encoders like NG Codec is working on. And some other mostly Chinese companies working on FPGAs or ASICs for AV1. There are actual deployments of AV1 almost since day one. YouTube, Netflix, Facebook, but quite a few cloud vendors who are selling a lot of things about encoding video have an option to encode to AV1. So people care. And you know what, there are specs and there are other specs to put that inside. MP4, MKV, TS and all the stuff like that. Amazing, right? And this year is a year of AV1 because hardware is coming. Intel has more or less announced that they were doing it this year with their Jenton chips. Nvidia has also more or less said that they were for the next chip, so probably around when they announced product which is GNC usually. AMD also, but like the Samsung TV at CS had AV1 decoding, I'mlogic, Podcom and a few others I forgot show chips that were actually usable, produced that can decode AV1. And all of those can decode both 8-bit and 10-bit which is quite important because like you remember for H.264, lots of people just had how 8-bit decoders and not 10-bit. And 2020 is also the year of the competition. Like you probably heard about VVC which is coming in July. EVC which is coming between April and July which is basically a version of a VVC with less patents. There is something called MPEG-5 LC EVC, ECV something like that, which is actually not a codec, but just like post and the pre filters. And the AOM community is already talking about research on AV2. So things are going fast. So is that an actual competition? I don't think so. Most in like VVC is amazing in terms of technical quality. Most of the improvements are based on H.CVC and H.CVC have three or four patent pools so like you can expect that. Since H.264 had two, H.CVC, VVC is going to have five or six or seven. Means like no one can actually deploy that because it's insane those many patent pools. And there is so many people who are now big companies who are just turning into patent roles like Nokia who are outside of patent pools anyway. And so the question is are improvements good enough to justify the cost or like H.CVC is not deployed anywhere than broadcast. So everything that is online is basically skipped that. So I guess it's going to be the same for VVC. VVC which is like oh, we're not really so much patented but then you remove some of the gains. So like if you remove so many of the gains you're going to be at the level of every one then why not choose every one, right? And then like other stuff are actually not codecs and could be applied to open source codecs. So I think that even if competition is coming they're going to have, that's their last big shot and they might have like difficulties. So David. David is an AV1 decoder as the name says. The idea was that we actually need to have a good and fast software decoder because like a lot of people are not going to have hardware decoders and until people can decode then you're screwed because your codec is not being to play anywhere. And the problem is that everything you're going to see it's true for AV1, for VVC and all the new codecs. They are very complex. Like a lot of code is required to write an AV1 decoder because we tried to take any small gains. So there is lots of tools and you take 1% here, 1% here, 1% here, but that's a huge code base compared to H264 or VP8 or VP9. So the idea was like we need to have a very good software implementation that is extremely fast and like every cycle counts because if we are going to deploy that it's going to be billions of people who will still have their correct and actual machine that they have today and not their new machines. And if we don't have that then everyone will fail. The idea was to use basically the people from VLC, FFMPEG and H264 who actually know how to write C and know how to correctly write portable and cross-platform tools and don't use CMake or some weird configure stuff like libvpx and libaom that basically is impossible to support to so many platforms. And one of the goal was to have a small binary size because for YouTube or Facebook, when they ship the decoder in their Android apps, they actually care about the binary size. And that was for example the mistake that was done in FFVP9 which was a FFMPEG VP9 decoder that was done before by basically the team that basically did David and they didn't care that much about the binary size and that was an issue. So we launched it last year. Now almost one year and a half ago in October 2018 and we had like a release quite soon after and it's been improving quite a bit. Just a bit of history. Like Announce was in October like already like three months after we had like the first release which was already four times faster than the reference and the only decoder at that time. And there was of course focused on 64 bits and then like after three months we did another release which was focused on ARM. Like we are already twice as fast as LibyOM on ARM 64 and the same we went every release we focused on like less important platforms like ARM32 and then SSE3 and then even SSE2 and so on and so on and so on. So we have like one release every two or three months which is quite nice. Performance is amazing, right? Like we're talking about three to five times faster than the reference decoder. The reference decoder has assembly in it, right? It's not like we're comparing C to non-C versions, right? It's assembly against assembly. When we started, Ronald who wrote a lot of the code of David said, yeah, we might be two and a half times faster. No, we are a lot faster. And we even faster on SSE2 where we did not write as much assembly as we did for the other platforms. Then ARM. So ARM, you can see that I'm comparing to the new, oh yeah, wow, I got a lead. The new one, which is Libji, everyone, which is another decoder wrote by Chrome and totally not in a non-invented year syndrome fashion because they really wanted to have one that will be faster than David on ARM. So that's the blue one. And as you can see, well, David is quite a bit faster already and they're improving, but they're not getting close to us. So we're talking two and a half to four times faster than the other decoders. That's a question that was asked quite a bit in the past which was what is the complexity of doing AV1 decoding? So here you can see in yellow, which is the FFH264 decoder, so the decoder of H264 inside FFMPEG, then you have in red the VP91, and then in green you got HEVC and David. What you can see more or less is that, of course, VP9 and H264 are way easier to decode, but that when you spend enough time on David, on AV1, you can do decoders around the same complexities and FFH264. Then you're going to say, yeah, but you didn't spend enough time in HEVC. Sure, we could maybe make the HEVC decoder a bit faster, but not so much faster. So everyone is not that complex to decode in terms of CPU, but as there are many tools, it's long to code. Okay, are there some weird stuff on the David? Yes, there are two, oops. There are two things that are weird for decoder. The first one is actually to dual pass decoder, not encoder, you see I just made the mistake. It's quite hard to have that. The first part is basically to analyze a few things to be able to schedule after. And also to do a lot of passing because there is a lot of things to pass at the entry point of an AV1 stream, OBU and other stuff that were probably not the best designed ever, but they were in a rush. And the second one decodes. And we have what is more important for most of the people is that we have a dual threading model. So we have a frame threading model, so you start decoding one frame before the next one. But at the same time, so we have a slice or tile threading models where you can basically, because most of the video you're going to see in AV1 are encoding slice or tiles. So you can start decoding the first row, the second row, the further row and so on. So when you try to get the best performance out of David, you need to basically set both how many styling thread and how many framing threads you're going to have. And the thing is we might add more threads for the filters. So we need to do something about that to have everything automatic and works at the best. And we would like not to use machine learning to decide that, even though it's completely in fashion to do that. The rest looks like a normal decoder like you have in DBV Codec except a bit bigger. So the question is why is David faster than the competition? And there are like three main reasons. The first one is like, we're seeing the single thread C version. And you see that the C is quite well optimized and well written because already the C version in single thread is fast. And more is coming because there is one big part of David that is not optimized in the C version that is coming soon. In two weeks? Yeah, sure. In two weeks. The threading is quite amazing. So is this the number of threads? Yes, this is 2,000 threads. That is a graph that was done by some of the Mozilla teams. And you see that basically like David's actually like scale with threading quite a bit. And if you see LibéOm or EJV1, they just like cap around four or eight threads and then they don't improve while David can still improve, which means that like it's actually good. Oops, of course. And we write actually low level code. That means C, so LibéOm is in C++ so there is no C++ overhead. We write handwritten assembly, not intrinsics. Yeah, but intrinsics are easier. Yes, they are easier. But you lose between 10 to 15% that's what we've seen lately on the various threads on the FFM. Like mailing lists, intrinsics are slower, almost always. So no interesting handwritten assembly. So basically David is like, the C version is faster in single thread. We can scale better with threads and we write lower level code, which means that of course no one is going to be David ever. So basically in an AV1, in David there is like eight stage that are basically that we managed to ASM correctly. The first one is the MSAC, which is basically an entropy decoder. There is inverse transform, the motion compensation, the intra prediction, and then the full after is our loop filters. Loop filter, loop restoration, the famous CDEF and the film grain. Film grain is quite debatable because a lot of people don't like that. But so when you now start and run David and you're perfect you realize that the parts that are basically non-ASM are around 25% of the runtime. And that's mostly on what we call the FMV and the decode coefficients. So that's for AVX2, FSC3, ARM34, and I'm 64, I'm 32. You can see that basically the optimization done for AVX2 are full and there is not much to gain on writing better optimizations. There is just like a few tools like in 4.4.4 or some intra prediction that are not done. For SAC3, SAC3 is probably one of the most difficult assembly to write because we need to support both 32 bits which has absolutely no registers and 64 bits and you need to care about the Windows calling convention and the Linux calling convention and of course Mac which is another mess. So that's quite a bit difficult but that was mostly done. There is some film grain that is done but it's not merged yet. So next release it's merged. And as you can see, we did some parts in SAC2 because that was easier. Mostly it's an entropy decoder which there is no in AVX2. Some of the motion compensation, one of the loop restoration and the CLEF. ARM64, you can see that mostly everything is done except the film grain and ARM32 has still quite a bit of work to be done. The entropy coding, the invest transform which is large to write in for everyone because it's quite large and the intrapredictions like there is DC, H and V that are done but the rest are not. Which brings me to that. Here you can see basically the, whoops. You can see the X264 is an encoder, it's extremely fast, it's blah, blah, blah. We wrote a lot of assembly and when you look at the graph of X264 you see that around 25% of the code base is assembly and the rest is C. And of course like if you look, X264 is 68,000 lines of code and around you see 37 lines of code of assembly. And when you take the whole Liby codec which is half a million lines of code in C, you see that it's 80,000 lines of code of assembly. But David is weird. It's only 25,000 lines of C but it's already 64 kilo lines of code of assembly. That's like, that's a lot more than X264 and almost as much as what you have in Liby codec because there is the 10 bits arm assembly and X26 assembly that is coming soon and it's going to make that just for one decoder to be right more assembly than the whole Liby codec. We worked on something weird last summer. So if you listen to me like one year ago, I would say like, yeah, everything related to GPGPU is idiotic, it's too slow, there is too much latency. I've said that quite a few times and other people of the community said that too, right? It's very difficult because then you have the latency to upload the textures and get it back and it's quite difficult. A lot of tools are QDA based and which is not open source at all because NVIDIA doesn't like open source. But we said like, what about we try anyway? In David, the film grain is already a GLSL shader, right? You have a C version that is optimized but what we advise people is to actually do that in the player because the film grain is really after that your decoding is done. But the question was, can we do more? Can we do CF loop restoration and other loop filters? And it's very difficult because you cannot really know if it's going to be faster but do you actually care about being faster or do you not? What you mostly care about is the consumption, right? How much battery drain is it going to be on your phone, on your underage phone? And because as soon as you got 60 FPS, you don't care of doing 70 FPS. It's only for gamers who care about 144 DS. But for video, you don't care about that. What you care is to not drain your battery when you're watching your video on Instagram or in Snapchat or whatever you're watching. So we had a GSOC who did that during the summer. The guy had no idea about most of everyone. And so he did basically the C-Def and the loop restorations of both SGR and Vinner in Vulcan shaders tested on an Android Huawei phone. I think. Yeah, Huawei P20. And what you see is that, what we saw is that we didn't get any speed increase. However, for like basically the same decoding time, so we put basically a VLC running and playing the file on loop, you see that basically we get 20% less battery drain on that just by using GPU computer shaders. Which was not expected. So what is the future? The next work on David is going to be, of course, 10 bits. And then decide what we can do about GPGPU and how can we move that to the next level. Thank you. And I'm taking questions. Am I allowed? Yeah. How much time do we have? We have a lot of questions. I'm not sure we'll be able to answer all of them. But let's start. How does the dual pass requirements fit with the current load agency trend in the streaming world? I don't think it matters most, much on the way it's done. It could be problematic in some cases, but I don't think it matters the way it's implemented. Maybe you should just read that also, but that should be okay. Are the Cisvell patents pulled for everyone valid? I would say no. They are surely everyone patent pull patents that exist, because there is a patent on absolutely everything. I think it's very, very, very small things, so they don't matter at all. And I think it's just a usual fud bullshit that they're doing. All of this is for 8 bits, which is nice and fast. What about 10 bits and HDR? Yeah, it's coming. Probably first we'll be ARM64 and then X86. Is it possible to use AV1 as an intracordemt? Yes. Is it a good idea? I'm not sure. Why not? I think stuff like FFV1 are easier to do that. I mean, I don't think everyone was done for that, but there are people around there that might know better. AVIF. Yeah, AVIF for images, but... What are the major companies missing from AOM and how to push DVC? I would say that some of the people who are in AOM, like Samsung and Apple, are important, and we don't know exactly what they're doing, because those are big companies. Who is currently using David? Oh, that's a good question. So every version of Chrome, every version of Firefox that is shipped, every version of VLC, Fempeg, and most players based on Fempeg use David, so that's around everyone that is shipping everyone, except Android. What kind of operations do you use from SSC3? But I'm not in SSC2. Just thinking... Ask that guy. If you want more details, ask him. Can David use GPU acceleration? I think you answered that. Yeah, no, that's not OpenCL, not QDA. This is compute shaders. We're using Vulkan compute shaders and not. When do you think we can expect wide AV1 support in cheap low-end devices? End of the year. September, you will have cheap or cheaper Android, like 200 to $300 devices that can decode AV1, 8-bit, 10-bit, 1080p. Well done. Thank you, David. Thanks.