 Alright, so this is just to make sure you guys are in the right room. If you don't know by now, Alliance for Open Media is making a codec called AV1. It's a joint effort by a lot of companies to make something royalty-free. And I realized that we're in Europe and you guys believe you don't have software patents. But in fact, a bunch of European countries like Nokia and Philips and Fraunhofer all file lots of software patents and are very prolific at it. So, sorry, you guys do have software patents. What I'm not going to try to do is give a complete overview of what is in AV1. I gave one of those at video line dev days. If you weren't there, sorry. You can watch the talk on the web. But you can't possibly do that in 25 minutes. I tried to do it in an hour there and it was still quite rushed. So I'm mostly going to talk about is what has happened since that last talk. The first thing is we had a couple of new members, but not anybody really important or anything like that. But the most important question I'm sure you're all asking is, are we done yet? And when we had our last board meeting in December, we said, okay, the date to finalize the specification is going to be February 2nd. And that means there's no problem for me flying here to Europe to give a talk on the third, especially since I had to get on a plane two days ago, and announce that we're in fact done. No, no, we're not done. But getting really close. So what's left? There are basically three main issues. We have to fix some remaining problems with the transforms. There's some final details of the high level syntax to sort out and some last minute changes to motion vector prediction that the hardware companies wanted and also fixing all of the bugs. And then finally completing our IPR analysis. So the IPR analysis we're actually making pretty good progress on and that don't anticipate being too much of a blocker. The bugs aren't that bad either. So that's the full list of bit stream normative bugs. It's not, it fits on a slide. None of them are even assigned to me, which is why I can come here and give you this talk. That's probably not all of the bugs that we have. Well, it's definitely not all of the bugs we have, but some of them are not blocking the bit stream freeze. But this is something we hope to get sorted out in the next couple of weeks as opposed to many more months. The specification, I will put that URL up again. If you remember this from my talk at Vila and Dev Days, it's got a lot more stuff in it now. Please read it, give us feedback. You will be pleasantly surprised how much is there. So what has actually changed? These are all the very technical details. If they're too technical, I'm sorry. Okay, that's what I want to hear. She says there's no such thing. All right, so let's dive in. The entropy coding. So last time I talked about we had smaller multiplies than we had previously. Now they're even smaller. So this is, again, a thing that speeds up hardware and saves area. So we replaced our 8 by 15 bit multiplies to 8 by 9 bit multiplies. And how we do that is we just shift the things down by 6 bits before we multiply. The problem when you do that is that your cumulative distribution functions that are 15 bits and we're shifting them down by 6. And so some of the adjacent values can collapse to the same value, which means the probability goes to 0. And that's bad because it's really hard to code something with 0 probability. So what we do is essentially inside the entropy coder itself, we reserve a little bit of space for every symbol. So the other probabilities actually go to 0. This is about as complex to do as our previous method of updating the CDFs and making sure none of the 15 bit probabilities went to 0. But as a bonus, it means we now actually don't care if any of those 15 bit probabilities go to 0 because we'll just add this little bit of extra space. So we can stop doing the work there and do it inside the entropy coding engine itself. So that speeds up hardware and costs us basically nothing. We still keep the probabilities in 15 bits because when we do adaptation they change relatively slowly so we want that extra precision. It turns out if we just make the CDFs smaller directly, then we lose a bunch of coding performance but just shifting them down just in time, we don't lose that compression gain. We also simplified our backwards adaptation. So the way this works is at the end of one frame you save off all the probabilities to use for the next frame but if you have a video with multiple tiles you want the tiles to be able to be decoded independently then what we used to do is we would average all the probabilities from all the tiles together. The problem is when you have many, many tiles you have to buffer all these things and add them all up and Harvard said that was too complicated. So now we just pick the tile with the most bytes in it, most compressed bytes in it and just use those CDFs. And surprisingly this works basically as well but it's a lot simpler. All right, transforms. So we added transforms with a 4 to 1 or 1 to 4 ratio of width to height. So now we have 4 by 16 and 16 by 4 and 8 by 32 and 32 by 8. That's good because we already had prediction block sizes of these sizes, we just didn't have transforms to cover it but now we do. So we can code those with a single transform. We also added 64 point transforms so we can go all the way up to 64 by 64. The one caveat of that is that everything outside of the upper left 32 by 32 block is forced to zero. So generally you only use 64 point transforms on blocks that have very consistent texture and you don't really have a lot of energy in those very high frequency coefficients. But mostly what this means is that the hardware companies don't have to make much bigger buffers and we don't have to have all these new coefficient scanning patterns and all these things. So that's why that was done. So we had these great transforms from Dalla called Dalla TX, they weren't adopted. Sorry, we tried really hard. But we have been fixing a bunch of problems in the other set of transforms that was being proposed. So now things like the order of the row and column transforms is actually consistent for all the rectangular sizes, which wasn't true before. We put VP9's 4 point ADST back because it has lower latency in hardware than what was being proposed. But now we found out that it has 64 bit overflows so somebody's fixing that. The type 4 discrete sign transforms are now consistent between the DCT and the ADST. So it turns out that the DCT has a type 4 DCT embedded inside of it, but that embedded transform didn't match the DCT that we're using to implement an actual DCT. And so now we've sorted that out so they're consistent. So you can, for example, reuse your implementations for both. The extra scaling for rectangular transforms, basically there's an extra factor of square root of 2 in there that's required for the implementation here, is now done in the same place for all of them. So that helps. And there have been lots of changes to how the scaling and dynamic range and all that stuff work. So the main thing that has to be resolved right now is that the overflow handling is unclear. So basically if you inject quantization noise, arbitrary quantization noise into your coefficients, then you can cause all sorts of arithmetic operations to overflow. And it's unclear how you're supposed to handle that because the C code does one thing and the SIMD code does another thing and the specs is something else. And so all that needs to be made consistent. All right, that was transforms. Coefficient coding. We completely threw away the old stuff and replaced it with something else. So it's this new thing called LVMap. Basically what it does is up front you code the position of the last non-zero coefficient and then you scan the coefficients in multiple passes. So the first one sort of says, you know, are you a zero, a plus or minus one, plus or minus two, or something bigger than two? And that's coded with one four value symbol. And then you code all the signs of all the non-zero values in a separate pass. And then in your third pass you go back and find all the things that were bigger than two and code all those large values, say exactly which value bigger than two. And all these use context based on all the stuff that you had already previously decoded in the previous passes. So the advantage of this mainly is that you can get away with a much smaller number of context than we had in VP9. So if you remember VP9's coefficient coding, it has some giant five-dimensional array of context and we have a much smaller set for this. So I think it is roughly a quarter of the size. So that's a thing that makes hardware companies happy. We've added a new inter-prediction mode called inter-block copy. It copies the contents of your current frame into your new prediction block. The location that you copy from is especially specified by a motion vector, even though there's no motion because it's in the same frame. There are some restrictions on that that work with hardware. So the source must be more than two super blocks prior. That essentially allows the hardware decoding pipeline to have some latency in it so that they're actually finished decoding the block before you need to copy it into the next one. And also loop filters are disabled. So the problem there is that you wanted to be able to do the copying before you ran the loop filter for the current frame because you hadn't finished decoding the current frame yet, so you couldn't run the loop filter on all of it. But if you then went back and ran the loop filter, it meant that hardware had to write this stuff out to memory twice once before you loop filtered it so that you could copy it back and then once after you loop filtered it so you could use it as prediction for the next frame. So instead we just always turn the loop filter off. That sort of makes this kind of a special purpose tool that won't be easily usable in all situations, but in the situations when it is usable, it can provide some very interesting gains. Motion vector coding, so just to recap what I said of video line dev days, this is a super complicated scheme and I didn't even try to explain it then. And the current status is it's a super complicated scheme that I'm not even going to try to explain because we only have 25 minutes, but all the details are now totally different. So basically what's happened is hardware companies took a look and said that the old scheme had a lot of dependencies on completely deriving all of the exact values of all the motion vectors before you could even parse the bit stream for the current motion vector and so there's been a bunch of work to try to eliminate those dependencies so that you can do the entropy decoding before you've completely reconstructed all of your neighboring motion vectors. And so they're working on some more changes to potentially simplify that some more and that's one of the three issues I highlighted back at the beginning. So that may change a little bit more before we're quite done. We also added some new tools. So one is this MFMV tool which stands for motion field, motion vector, something, something. Basically the idea is that you take your motion vectors from your reference frames and project them onto your current frame just by linear extrapolation and then gather all the candidates that intersect each 8 by 8 block. So what this lets you do is get predictions from things that are fast moving and very far away as opposed to just looking at your co-located blocks which only can capture relatively slow motions. So this is sort of like direct mode but unlike direct mode which is entirely designed to get very smooth, smooth, slow motions this is designed to get very, very quick otherwise unpredictable motions. So that gives something like a 1% coding improvement. We also changed how the warp motion sample selection works. So if you recall what warp motion does is it computes this affine transformation of per pixel motion vectors based on building a local motion model by looking at the current blocks motion vector and a bunch of motion vectors from its neighbors. So we added the upper right block to the list of motion vectors that it uses to build that model and that helps you get a little bit more robust model. And we also now remove motion vectors that differ a whole lot from the current blocks motion vector which helps the fitting be more robust. We also added this extended skip mode so if you recall a skip mode in VP9 just means that you don't code a residual but like you still have to code which reference frames you're using and what prediction mode you're using and which motion vectors you're using and all this other stuff. So what we do now is when we have a reference frame right before the current frame and a reference frame right after the current frame we can enable this extended skip mode and if a block is marked as an extended skip then we know right away that it's inter-coded we know that there's no residual which is what the old skip was. We force it to be coded in compound mode which is the VP9 term for by prediction and using one forward and one backward reference the two ones that are immediate neighbors and then we just always use the best predicted motion vector for each of those two reference frames. And then we code nothing for the entire block which is how skip mode works in other codecs. So we actually have one of those now that helps save a bunch of bits. All right. There are a bunch of changes to loop filtering. So we made deblocking modify one fewer line and why is that important? What it helps us do is eliminate some of the line buffers in Cdef and loop restoration the two new loop filters that we added. So we also changed the offset of the loop restoration processing blocks and how the super block boundaries are handled to align them with Cdef output. We didn't have to change Cdef at all because of course Cdef is perfect. Cdef is a Mozilla tool if you were wondering. Loop restoration itself we simplified the self-guided filter so what that means is that essentially the self-guided filter computes a bunch of parameters and computing those parameters is very expensive and now we just compute them less often and interpolate the results. So that makes that less complex. The upshot of all these changes is that the total number of line buffers that you have to buffer in hardware for all these things is now reduced to 16. It was 30 before all these changes so it's kind of a significant reduction. Which is also conveniently the same number of line buffers required by VP9. So despite the fact that we have new loop filters, we don't require any more buffering and that makes hardware vendors happy. Alright, five minutes left. Don't worry, we're almost there. Frame super resolution. So if you actually know what the academic community calls super resolution which is upsampling images by using motion interpolation to get finer detail that's not what we're doing at all. What we're actually doing is coding a frame at a reduced resolution then upsampling with a very simple filter and then running loop restoration on the full resolution image. And basically the loop restoration of the full resolution image gets rid of all the artifacts from the very simple up filter. So only horizontal resolution reduction is allowed so we can squeeze the image this way but not this way. And that's again to simplify hardware buffering. But what this lets you do is get you much more gradual bit rate scaling so you don't have to do things like cut your entire video resolution in half or something like that. Spatial segmentation. So we added a new spatial segmentation mode. It predicts all the segment labels that you have in VP9 but it doesn't do it temporarily, it doesn't spatially. More importantly it allows you to not code a segment label for blocks that are skipped and what we use this for is being able to change the quantizer on a block-by-block basis which is important for things like adaptive quantization i.e. activity masking or temporal RDO i.e. MB3. So we can now implement those much more efficiently which will be important for getting good visual quality when we start tuning for that. There are a bunch of other changes so we change cross style dependencies to allow latency encoding and repacketization of tiles into different tile groups. Tile groups are sort of the mechanism we have for putting things into for example an RTP packet and so now we can change the sizes of those with some freedom. There's a decoder rate model so one of the difficulties you have when you don't have B frames but instead have this concept of hidden frames is you no longer have a fixed reordering depth. So if you have hardware and you say well I could have six hidden frames before I decode my current frame does the hardware now say well gee you have to be able to encode it six times the frame rate to guarantee I can decode everything so you know you can put a rational model around it and figure out something that they actually implement. There's CICP color color space made of data so you can store things like primaries and transfer functions and we added support from Mono Video because it was easy. Alright, how are we doing? So instead of presenting metrics that we collected ourselves like I usually do Moscow State University just this week published a report where they compared X264 and X265 and VP9 and AV1 and so you can see the four things four bars there in the upper right are X264 and then there's a big jump down into the next generation codecs which is most of those are X265 the one right there is VP9 which is better than everything except placebo mode it's a lot faster than placebo mode by the way and then there's another big jump down to AV1 and this was using a version all the way back from June so we're a lot better than that now and that's already doing pretty good so we expect that to work out well are there any questions? I have 36 seconds left 5 minutes for the questions so yes on the back thank you for the presentation my question is I'm working on FFMpec bug reports and apparently if you take out the iframe from a VP8 or VP9 stream the stream never recovers while with all MPEC streams I've ever seen and tested you can take away the iframes and the stream will recover and if there's a scene change it will even recover very quickly is that something that you know how it works in AV1? right so the biggest reason for that is this backwards adaptation of probabilities so what happens is that you code a frame and all of your probabilities are updated as you code that frame based on the symbols that you code and then in most traditional MPEG when you start the next frame you reset all the probabilities and start from scratch while as in VP9 and also in AV1 you start from the previous probabilities that you use from the previous frame so as a result if you lose one of those previous frames then in your future frames like everything is wrong because you're not even decoding the right symbols and is that not an issue I mean maybe I'm not seeing it but I thought it looks like an issue to me right so there are a couple of things one is you can turn this off it's not the normal mode of operation because it does cost you compression efficiency I think in our last test it was around 2% which is fairly significant but even if you don't turn it off one of the things that we changed in AV1 is that we now can derive which probabilities you start from from one of your reference frames so you know if you always have all of your reference frames then you can continue to decode and that means if you're doing for example a real time scenario where you have multiple different spatial or temporal layers if you lose a frame you can still recover by dropping back to one of the previous frames that you actually received and as long as you have all of the reference frames needed to decode a current frame then you can decode it successfully so you always have to keep the reference frames let me repeat the question the question was doesn't that mean that you have to keep the reference frames in memory at all times the answer was you always have to keep reference frames to be able to decode future frames but the limit we currently use is 8 reference frames and so 8 reference frames is enough to do for example 3 temporal or 3 spatial layers at the same time which is at least a reasonable real time broadcasting scenario another question that's coming thanks so do you see AV1 AV1 as a codec come to a place where it's going to replace H264 in the solutions like WebRTC in the future so the main difficulty with that is getting something that that runs fast enough and it's definitely not currently in that position so one of the things that we're going to be working on next is writing a fast encoder for it I'm at least reasonably hopeful that that will work out well in the sense that I think we have another alliance member who has for example a real time implementation that gets around 30% better compression than VP9 with the similar amount of CPU compute so I think that's a significant enough improvement that it will take off for WebRTC but it'll probably take us a while to get the software into a state that we can use it for that purpose would you say that the codec is memory bound or CPU bound it's actually both yeah thanks so Diego's question was how fast does the encoder I think it is currently somewhere between 2500 and 3000 times slower than VP9 we'll make it faster yeah so as I said our next goal is to work on a faster encoder we have one that only does 4x4 blocks soon it'll be able to code coefficients larger than 4 but it's so fast that we test it on 4 minute videos because why not thank you team