 And our next talk is related to the previous one. This one we're talking about the encoder. Please welcome Luca. So hi, everybody. We did a bit of a talk for decoding and now in presenting encoder. So I'm a contributor for both David and Ravi, among the other open source software. Here are the contacts if you want to ask me a question after. So I will talk about everyone, a lot about Ravi. Probably I will mention, rather, a couple of time. I won't be too preachy. I will talk a little about memory performance profiling on Linux. If you weren't aware with the previous talk, you probably already know what I'm going to say. This won't be about many details because I don't have enough time. It's mostly about draw maps. So interact me any time, ask me a question. I will try to answer. So I hope you will have fun. First, Ravi and everyone encoder is written in Rust. Strange, we know. It does have lots of our specific assembly because we are importing it from red, David. And since we are using Rust, if you want to try it, it's quite easy to do that. You just use cargo and you have it. If you are more traditional, you can enjoy it from gstreamer, ffmpeg, and pretty much any software that could consume our C or a Rust API. So Ravi ends to be fast, fish are full and safe. How far we are? Let's see. So Jean-Marie has already presented what everyone is who is involved. So I can skip all of this because I know that all of you were before. So the summary is everyone, we are pretty well positioned for the decoding side because of David. We have support for all the browsers, again, because of David, mainly. And this part is a sort of done problem beside 10 bits, but work's being done. And we have hardware. So all is good, right? Well, encoding is a different story. Encoding is normally hard. If we consider the past history, H264, about seven years to be the great encoder that it is. H265, another good project. To be a good competitor for even H264, it took about the same time. And in the case, it managed to leverage a good deal of experience because they shared a bit of code with the previous one. HEVC is much harder, much more difficult than H264, right? So what happens with every one that is a lot more complex with lots of more fissures? Well, what we have here. Open source-wise, we have LibAOM, SDTV1, both are coming from a lot of previous code. One is the Hineritor of VPX. So most of the structure is from there. SVT is a whole family of encoders. You have SVT, whichever code that you can think about. Again, long tradition. And they are doing a lot of effort to get everyone ready and produce something that is as amazing as everyone should be. Because everyone, in itself, at least in paper and partially on practice, is a really good codec. So what happens? Well, LibAOM is done slow. It's really slow. Is where all the experiment happened. And because of how it's managed, we could say that it's some kind of graveyard. Because even the code that didn't manage to the specification sort of lives inside, lingering. SVT-V1 is blessing fast. It's really fast. It needs lots of hardware, a lot. And obviously, there are trade-offs. So currently, SVT-V1 could be a good solution, at least if you have enough seal to sacrifice to the SVT god, if you don't, well, through grapes. Ravi, what's the plan with Ravi? Ravi is completely new. It comes from a different kind of experience. Because most of the DALA team is now working on Ravi. So it's a firm scratch, written in rust. But we have some background, so to speak. And we focus on something completely different. We want to explore. We want to leverage the experience from DALA. So the focus is on getting different solution, try different path, use different algorithms, and focus on trying to get the best perceived quality. So speed is a concern. Memory footprint is a concern. But the main focus is try to experiment more. And see what we can do. And obviously, initially, it was quite fast because it was quite tiny. And we want to try to stay fast and even faster. So first part, we want the code that is readable. So not many, many, many too many lines of code and not something that was sort of smart to write once. And then the future view is going to complain a lot with the past view for your choices. Speed is a concern. But we don't want to get speed just because we want to use more hardware. We want to get something that is fast. It doesn't matter the kind of hardware that you can have. Compact, that means you can have multiple instances without requiring way too much memory. And we would like to make sure that real-time encoding could be a thing, batch VOD encoding could be a thing. And everything that is in between those two extremes could be viewable, so a lot on our plate. When I say that Ravi is lean, I mean that if we consider Lib AOM that is large, it is really large. And the code is lots of C, lots of C plus plus because the way we are doing tasks in Lib AOM is sum assembly. So you can get lost just because there is way too much line of code to search and see through. Ravi, consider all the optimization, is nearly a fifth of it. And if we consider just the last code, so no assembly optimization is about 55k line of code, so fairly tiny. If we aggregate the two projects, David and Ravi, so we have something that is functionally similar to AOM, we are still half of the size, even if we consider all the assembly that we are using. So if you care about everyone and you want to have an idea of how it works, you can take the two projects all together, just the C code for one side. And the rest code for the other side says within about 100k line of code. So it's still quite less to read to figure out what's going on. And both code bases are sort of easy to read compared to others. So we want to be fast. How you can get faster? Well, let's say our first focus is to get better algorithms so even the theory behind has to change before you can actually get something that is fast. But also you can just look at what you did and try to figure out if you can not do some work. And another easy way to, much easier way to be faster, just leverage what the CPU provides. SIMD is available pretty much everywhere. Using SIMD is something that gets you good results and does not require as much as intellectual effort as thinking all the algorithm that you are going to use. Another item that is important is to be careful about how you use the memory. Cache locality is something that is going to kill you or save you, depending on how your code is laid out. Last, so one online question. Is parallel encoding a thinking brevy? As distributed in several machines, I will answer it at the end. It will be. So last but not least, multi-tread processing. We throw more hardware at the problem. And since in many cases we do have multiple cores in our machine, could be useful, depending on your use case. What I mean with algorithm improvements? Well, we can have something that is sort of easy. So we have lots of processing that is working on applying some kernel on an image. And in many case, the intermediate results can be reused. The concept of integral images let you lay out what you are doing so that all the intermediates do not have to be recomputed all over. And that managed to speed up a lot the loop restoration passes. Very distortion optimization. This is where we are spending most of the time. So what you can do in that case? Well, this kind of code is like walking in a tree. So you make decision and you decide where to go. If you prune it properly, because you know that going down is not going to lead you to anything useful, you're going to spare a lot of time. So we did a lot of work to get some early exit condition set up so we are not doing work that we are going to discard anyway. And this kind of work is something that we are doing all the time. SIMD, we love SIMD. Everybody loves SIMD. We don't want to write some SIMD code. Well, a good number of people, but anyway, we like it. So how to do SIMD in a RAS code base? Two ways. One is using STD arch that is part of the standard library. They are somehow like the C in 26, but arguably better performance-wise. On the other side, assembly is good. The people that are working mainly on David love assembly. We can share it. We can use it. And since we are using RAS, even the compiler is going to help us much more compared to SIMD because the RAS language gets the compiler more information. And through that, the compiler can produce better octavectorized code. And that is helping us a lot. Even more if people want to use AVX2 because you can just enable it, and then the compiler is going to produce fairly good AVX2 code for your normal loops. So that part is good. Multitrading and RAS are sort of, as with story, since when you're writing multithread code in other languages, you will end up making mistakes. You will end up spending lots of time debugging it. In RAS, you cannot do those mistakes. If it compiles, it usually runs, besides if you made logic mistakes, but in the case, it's your fault. And what we can do with that? Well, another question. We'll take the question at the end if you want. OK. So that one will wait. So I was saying multitrading. We can do that. RAS enabled you to do something with much easier. But also, RAS let you have something that is sort of magic. Because RAS abstractions are really zero cost most of the time, and as I said, if we are using iterators, the compiler is going to vectorize them already. What happens when you are using something that takes your serial iterator and runs it in parallel? Well, you have parallelism for almost three. What does it mean, almost three? This is our main loop. It's a bit of a mouthful. But basically, we work on ties. And for each tile, we encode it. Simple, right? OK, so this is serial. You get the list of ties. Each tile gets processed. And that's it. But the ties are independent. So we want to do that in multiple threads. That's it. Just a single line. And everything works in parallel. We don't have to think much. Well, we have to think a little. We have to make sure that the data types we are using are sort of thread safe. And we have to not mutate what we are doing in the closure. With closure, I mean this thing. And that's it. That's how we can get lots of multi-thread goodness with the minimal effort. We are doing even a bit more work because we are not so lazy. So in the future releases, you will have an alternative API that is based on channels. So people that are used to go or people that are writing rust, you will have an API that is much more streamlined and much easier to use. And this is how it looks now. So our API has a send frame call that is using to feed the encoder with frames. And a receive packet that is pulling out from the encoder the encoded packets. Sort of simple. And this is the effect of rayon. All this part is running in multiple threads, more or less in an optimal way. We have work to do on getting this part on different threads so we don't have this kind of large gap that is fully serial. But this is how easy it has been improving our situation. How we are doing that, what we are doing. I say I will mention some of the tool that we are using because I did that in the morning. I will compress it. But mainly we try to keep all our code as good as possible. We are trying to not use too much memory. We are trying to see if the new tools that we are implementing are having a strong impact or a small impact on the overall speed. And to see what I mean regarding measuring, this is what happened to the location. We were using way too much memory and way too much allocation, in my opinion. 0.1, 6k, quite a lot. The kernel has to work a bit. 2.0, we managed to cut that to half. That is something that caused also a speed increase. Two days ago I ran the numbers and we got even farther below. So again, something that is useful. And we do this kind of analysis more or less all the time. To give you a comparison, this is SVTFE1. As you can see, it's locating a bit of memory. I mean 1 gigabyte, same content, 6 gigabyte. Now you see what I mean when I say that you have to be resource conscious. Speedwise, we always keep improving. We try to keep improving. This is what you see at our top speed is not something that you can write home yet, because more than three FPS is still not something that is exactly great. But compared to about one FPS, well, we are doing well. We are improving. We will keep improving. And this is about speed, about specific features. Since I say that Ravi is focusing on different algorithm, we did work on audio biasing. That is basically, we have our decision tree, and we try to move the decision based on how the future will behave for each block. So if something in the future will say the same, we will try to bias it. So it will decide to keep the block, even if by the matrix that you have, you can apply just for the single frame, it might not be considered that interesting. Chroma-lumen balance, that's something that goes against the common sense on coding. Because if you consider YUV, you say always, luma is more important than chroma. Well, it's not always that. Because once you start to quantize the true, you can end up to a point in which the chroma differences because of quantization are something that you are going to perceive more that the luma differences for quantization. So you can try to strike a balance. So with your bit budgets, you are going to spend a little bit more on chroma and get better perceptual results. Last but not least, everyone has a concept of per frame quantizer deltas. So in every frame for each block, you can change a little your quantizing up and down. And you can optimize a lot with that and get optimal results without using many bits to signal that kind of change. Since we like to have pictures, are you biasing? So the tree is always the same, but this part is going to change. So you are not going to spend a lot on this chair, even if it's something that you can predict quite well in that picture because it's going to be covered. The tree, on the other hand, if you want to spend a little more in the past, so you are not going to spend a lot in the future. And this is the concept, quite simple. The implementation is a bit gory. Block importance, again, same idea. If the future is better, we are going to spend more bits. If the future is grim, we are not. And this is how we visualize the whole thing. How much time do we have? Minus two minutes. Oh, great. So trust me, everything is great. Keep it for next year, maybe. So what to expect? We started with 0.1 in Tokyo in December, VDD. We got 0.2 about a month ago. 0.2.1, in which we managed to have different kind of improvement. And we had some kind of trade-off. So we are overall about 1% better with a little slowdown. And for 0.3, that will appear in the next week, possibly, with more work on multi-trading, more CMD code written, work on code paths so the compiler is going to vectorize them for us. That's bound checks. So the safety from brass is not going to slow us down. And about a sixth of the memory allocation less. So mark compact. We are working on the audio biasing, so it works better. But that goes a slowdown on the high speeds. We are implementing more tools. So now we have fine directional prediction and intrage filter. We are giving more features to the user. So if you want to use which frame and experiment with which frames, we have it. If you want to use Revi to make still pictures. So everyone has an image format instead of a video format. Now that part is working. If you want to get crazy and put the encoder in the browser, there is a little bit of work that it will appear, so it will be quite easy to do that. Further in the future, the channel-based API should be complete by 0.4. So better thread usage, easier usage model for you. We are going to do a lot of work on the rate control, since this is one of the weakest points for most of the encoders. We're going to try to make it fast and overall useful. So doing a choose passing coding is not to be a daunting task. And the API is going to be expanded. So to answer to the initial question, Revi is going to support chunk encoding and the chunk can be encoded in different nodes. After the whole process, you will have a way to aggregate the whole thing, not just the packets that you are producing, but also the rate control information. So you can have multiple paths across multiple nodes. And this should happen in 0.4. The other question from the network, how do you track subjective quality over time? So you can see the questions. Are we compressed yet in which we spend a lot of CPU time to do multiple encodings with multiple settings from a large corpus that is giving you a good coverage and get lots of quasi-objective results? We don't have anything. We don't have any kind of group of volunteers that keep watching the same movie many times to tell us which looks better and which not. If somebody wants to volunteer that, they are welcome. I'm sorry we'll have to stop there, because it's already 2.30. I'm finished. If you have any more questions, you can ask Luca. That's it.