 Thank you. So who I am, I'm Luca. I contribute to many open source softwares. Lately, I'm focusing on everyone and I'm a contributor for Revy and David and today I will talk to you about Revy and how we managed to optimize it. So what is Revy? Revy is an everyone encoder. It's written in Rust. Currently, it does have a large amount of SIMD either by importing what is written in David in plan assembly, either written directly using the Rust intrinsics provided by STD Arch. It does have a good deal of multi-track code and most of it is leveraging Rayon because we are lazy. Today I will show you how we managed to get there, which tool we managed to use, what managed to work well, and which creates, make your life much easier. So why optimizing? If you have your software is working, done, right? Well, no. You might want to put your software in something that is tiny, so you want to optimize for size. You want to use your software for a certain purpose. In the case of a video encoder or even a database, you want to optimize for latency. You ask for something, you want something back as soon as possible. You send a frame to the encoder, you want it back immediately so you can put download to the network on the other side, a normal videoconference scenario, or security scenarios. There are many. You want to put your software on mobile, then you have two different problems. One you don't want to drain your battery and the other you don't want to burn the device because sadly the mobile CPU can be really fast, but just for that amount of time then it gets dangerous. So many reasons to optimize. Another use case, you just want to be resource conscious. You don't want to pay that much money to run your business. So you want to optimize for throughput that is a different target. It's a mix between optimizing for speed, CPU usage, and other resources, mainly you are optimizing for money. And last but not least, you want to prove that you are the smartest guy. So you want to make the whole thing as fast as possible. That works is not exactly a good reason, but if you have multiple competing projects, it is a good reason to have some kind of healthy competition. So why we are optimizing value? What we want to do? Well, conquer the world. But we are talking about video encoding. So for video encoding you have different targets. Maybe you want to have the best quality, that's something sort of esoteric because quality in video is something that is not exactly objective, it's mostly subjective. So you put all the time possible, all the memory possible, everything possible to get the best looking video. That's one target. You care about the single encoding speed, some kind of track racing. Okay, I have a single video and I want it encoded immediately or as fast as possible. I don't care about the rest. As track racing is as useful in my opinion, but it's a good benchmark, so to speak. How was possible latency? Great for security, great for videoconference. In the case, the amount of CPU that you can throw at the problem is limited by the fact that latency requires you to get the frame out. So you can try to use multiple tries if you can cut the frame, but you are restricting the trick that you can use. The true put that is I don't want to spend that much money to get results. This is another thing that is a bit more harder than just getting the best quality or getting the best speed. You have to consider multiple trade-offs because you can have a batch of videos, you have that much amount of money and you want to get them through and have the result that has the right quality and take the right amount of time, but your target is money. What driving is trying to do is to get in the right place in between all of those and as you may notice, quality doesn't care about the results that you are throwing off. Money is pretty much the best proxy for all the possible resources. So you want to get in between the right quality, not spending too much and obviously not taking ages to get results because otherwise it's pointless. So what I mean with optimization, how we do optimize. You need to pick your target, what you want to optimize for. If you have multiple targets, it's better if you pick one or two at most at the same time. You iterate, you manage to measure how far you are, so how much time I'm spending, how much time I'm spending in each function and so on and so forth, or how much memory I'm using or which is my bill for this month and once you do that, you can pick what you want to change, you change it. Is the results good enough? Okay, you move to the next target. You are not happy yet with the result, measure again. So, fairly simple, right? Who did something like that? Please raise your hand, okay, nothing new. So let me unpack a little for the people that didn't experience that. So we start, we pick a target, we select the use case. When we are measuring and splitting the task in two, we profile and we benchmark. The two things are a little different, even if our sort of synonymous. Then we do our evaluation and then we change the code and the loop continue. So let's say we are talking about metrics. Many of them are completely objective. Single execution time is easy. You just use the wall clock, basically. It's easy as well. You just send a frame and measure when you're getting it back. Simple, memory usage. Okay, who doesn't know what is the maximum residence set? We'll never hear about that. Okay, so basically, when we are running an application, your operating system is given the application the right amount of memory. Some of it has to be present all the time. Some of it can be swapped around because the operating system can be smart enough. The maximum residence set is that amount of memory that your application absolutely needs. So cannot be swapped around if you don't have enough. The system is going to kill your application once you try to assess the memory if we are talking about Linux. Other operating system are more conservative. Your application is not going to run. It's going to get killed as soon as you ask for the memory. So you care about that if you're running multiple application or you're budgeting your own system. Allocation counts, another kind of memory usage. We love dynamic memory, we love malloc. We like all the pile of six gold that that implies. Location count is something that can be problematic if you are locating a few thousand times during your runtime. We want to reduce that. Throughput, number of results per unit of time. Number of results per resource spent. How many video I get out for that amount of money? Throughput is a mix. It's harder to figure out when we are doing great, but still sort of easy to measure. Quality, completely application dependent. For video is a bunch of magic. We can talk about VMF, we can talk about PSNR, we can talk about lots of stuff. We are not going to talk about it here. So in our case, trade-off between quality and speed. You get better quality, you are going to sacrifice some speed. For the memory, you could do horrible, horrible stuff and use a lot of memory to get fairly better speed. But we tried to not do that or to not get way too extreme about it. For Revy, we tried to shift the focus between speed and quality, throughput. So 0.2, that was about a month ago. The focus was speed. We wanted to get everything faster. 0.3, quality. We want to get most of it better right in some ways. 0.4, we will probably try to balance everything and get a better throughput, all in all. So the idea is that we want to keep everything in balance. So trade-offs, we want to measure one thing and then the other. So we want to have good tools to do that. And also, since it's something that is an innovation, you don't want to spend lots of time during all the measurements. That can be quite problematic if your encoding time is like 3 FPS at best, right? So first, we want good use case. We want something that represents what our user is doing. How we do that? Well, either we can probe the user and force them to tell us or give us actual samples. Or we do that in a different way. Well, code coverage is sort of not rocket science, right? Who knows about code coverage? Who uses code coverage? Everybody. Good. Code coverage is good when you are testing, but it's also good when you are profiling and benchmarking. Because the queue are sort of quite close. But you are using different tools most of the time. The code coverage is not as important as when you are testing, though. So 99% you don't care. 50% is good. Why that? Because you want to care about, first, the low hanging fruits. Then once you manage to fix that, you move farther. So your use case has to be well-representative. But you don't have to push it way farther. Because it's going to take time to profile it all the time and extract the right amount of information. So 8K videos are cool. Don't use them. 4K videos depends on what you are doing. 1080p, well enough. If you have a good coverage with that and the video is complex enough, you are fine. You can even go lower. So how you are sure? I'll say it. One way, you profile. How we profile in Rust? We have a number of ways. First one, using the equivalent of Minus PG or Minus Instrument code or whatever is in the other compiler. In Rust, Minus Z profile. And this is already a problem. Because it's not stable yet. On the other hand, the output is completely stable, is completely normal. All the tools that you're normally using for collecting profile information and process it are good. GRCOV is fairly good if you want to push the information farther. GCOVER is still good as well. I'll call all the tools that you want to use. You have a way to do that. Mainly because GRCOV is going to convert all the information in standard formats. So we got covered. What's the problem with that? Instrumenting code takes time because you have to unpilot. And the runtime is going to be painfully slow. But you get the exact information. How to do better, speed-wise? You can use some kind of sampling profiler. GCOV is one of them. It's blazing fast. It does work still on a number of platforms, mainly Mac and Linux. And we have a good integration with cargo. So we don't have to think much. We can just use it. Same thing for the other tools to extract the profiling information and push it forward. You convert it. You get it. You get the website that you like it. And then you can do all your analysis that you want. Or you just get the number from the command line. About two, three times faster to give you an idea. If you're really hardcore about Rust, there is a project that is called Torpolin that is pure Rust. It's quite restricted. Works only on Linux, only x86. It's pure Rust, so you don't have to carry lots of dependency and lots of stuff. Sadly, it doesn't work for us because we are using assembly and it doesn't grok, this kind of stuff. So how we do that? The story about using the normal profiling system is not exactly straightforward, as you can see. A bit of machinery, then you run cargo and then you extract the information. And it goes like this. K-Cov is much, much, much, much, much nicer. And this is calling directly K-Cov. You can use the integration if you want to use it and it's a single command. But I mean, mainly, all it does is setting up K-Cov with the right include part. So it's that kind of simple. So let's assume that we are done with that. We got our use case. We want to profile it and then extract benchmarks. So what I mean with profiling? Profiling, which are common profilers? Perf, to mention one. Detrace, sort of hard to use and slow. I guess everybody has the experience with running Perf or running detrace in a way or another, right? Nobody? Everybody? Good. So there it is. We gather profiling information and we reason about that. So we can get something that is good enough for us to improve our code without spending that much time. I call them benchmarks. Consider the profiling stage integration test, benchmark unit test. Same problems. The unit test is not going to be well representative of the behavior of the application. A benchmark, you can improve it. You can improve that single function. And then if you are not profiling the whole thing, you can have bad surprises. That tiny function, you managed to make it fast. And now it's like this. OK, what happens? Well, the cache is something. And if the function is not fitting in the cache, it's going to be extra slow. Or maybe it's pulling out some other function that you would like to have fast. So benchmarks are good, but they are not the full solution. All the time, you manage to get a decent gain profiled again. Or if you are in the case in which actually writing the benchmark is not cost effective for your time because you cannot extract that path without doing lots of work around, try to squeeze the test case so it's fast enough even profiling directly. So what we can use before was theory, now is more practice. First, hyperfine is some kind of verified time, no time. So simple runtime. But it does all the statistical analysis for you. It does all the repeat test for you. And it's quite good because if the amount of noise in your system is low, you can just say, OK, I don't need to repeat my test lots of time to without the outliers because we don't have outliers. So this is the first thing. If you manage to get your testing box completely unused, hyperfine is going to tell you, hey, run it just once. It's fine. And why is fine? Because if you have to do the proper statistical way 30 times a run, every run takes a few minutes, oh, one hour is gone and you didn't do much. So first, try to get in that situation. Second, you want to see stuff. The analysis is important, but it's important also to get the information in a way that you can grok it immediately. So my first suggestion is to use cargo flame graph that under the hood is using perf and eTrace. So all the systems are covered, more or less. But instead of giving you data that has to be processed some way, it does the work for you. So if you like flame graphs or if you can't tolerate flame graphs, since some people love it, some people hate it, that's the tool for you. You can explore the situation. The graph is interactive, so you can click on it. You can expand it. It's nice. What's something that you can use also? If you are doing something on a more embedded system, we have not perf that is a pure RAS replacement of perf that tries to solve some problems that perf files. Mainly the fact that if you want to do the analysis, you have to do the analysis inside the same machine that is running the data collection. And if the machine is tiny, it's going to be a problem. They managed to solve it. Again, it does produce flame graphs. So the data analysis is already doing lots of work for you. If you like flame graphs, good. But perf, detrace, not perf, are sampling profilers. So you are every n times per second checking the situation. So it might miss stuff. And also, it's quite slow. Another suggestion, US trace, that is probably less known, requires you to do an instrumented build. So another unstable flag, Mino Z instrument count, but it's much faster. And the tool itself gives you pretty much everything that you want and much more. They did a lot of work on making it fast and also making it easy to use. So again, flame graphs, you can have it. Chrome tracing, you can have it. Any kind of data visualization, they have plenty. So if you're on Linux, on the right CPUs, do use it. Gets you great results. You're on Mac. Well, cargo instruments. Instruments is provided by Xcode. If you like common source, it's not great, but it's effective. And on Mac, it does work. I couldn't find any integration with VTune that would be my suggestion if you are really on Windows. So sorry for you. Learn how to install detrace and use cargo flame graph. I'm not using Windows. So once we have the information, we have to do something with that. We process it. We figure out which is the best, how many we have, as function, as timing, memory usage, everything. We extract the benchmarks. Something that is important, the only part that is important is this slide. When you start using threads, most of your analysis has to be reconsidered. Because every time you just want to get the information on how much time is being spent on the function, once you have the threads, if the function is happening in parallel, or it's called many times in parallel, optimizing it is going to have a quite reduced impact compared to optimizing something that is a bottleneck because it's fully serialized and is blocking your code flow. My suggestion to figure out what's going on and check it is to use lightweight probes. Because then, again, you are optimizing the time you are spending. Lightweight probes, on one hand, requires you to put the probe in the code. On the other, you are just probing what you are caring about. Hope tracer that had been also presented even this year, again, is one of the best. It's one of the easiest to use if you are using Rust. And it comes with really nice visualization. And as you might see, it's really easy to use. And you can use the Chrome tracing feature to actually visualize what's going on. In the case of Revy, you can notice that we have sun frame and receive packet that they are taking all the time. And you can see that this part is neatly parallel. So if you are improving something in a code tile that, in the serial case, is that kind of log, the impact is going to be tiny. If you are improving compute block importances, well, it's going to have quite an impact if you have that kind of data amount, of course. So once you start with threads, you have to care about it. And I suggest you use lightweight probes. Memory. What are we going to do with memory that is so boring? First, HyperFind doesn't support measuring the memory usage. I'm discussing with the author about it, but it's not really convinced. So you can use the old, stupid, new time. Getter usage is available in most system. If you are on the other one, Windows, there are some performance counter that you can use. It's not a function, it's a couple. But this is something that you can use and you can even use when you're using Rust since the Win API create does support it. Once you get the ballpark, you can dig down. Two tools to dig down. One is Malt and the other is memory profiler. Malt is written in C++, does support a plethora of different techniques to get the memory usage information. It does have one of the nicest web UI to actually dig down the function that is taking the most amount of memory and gets you all the data. What's the problem with that? Requires a bit of work to actually build it. Requires a bit of time because it's not as fast as the memory profiler. Memory profiler, pure Rust, yay, so much easier to get it going, is Linux only. So that could be a problem for you if you're not using Linux. The web UI is as rich or even richer. In my opinion, the two are pretty much in the same way but I prefer the memory profiler one when I'm just looking at the results and it does work quite well but it's a little harder to use for the reason that I would show you. On Mac, cargo instruments, what's wrong with instruments? It gets you all the possible information. Special mention, IPTRAC. If you are on Linux or you are using KDE even on non-Linux, IPTRAC counts with a UI that is really good for that. Malt and memory profiler provides the same kind of data that IPTRAC can process. So you can mix both words. IPTRAC is not as fast as memory profiler and Malt, at least in my experience. So this is what you can use as a tool chain and what you can do for memory, not much, right? I'm using that much for memory. I am going to slay buffered that are pointless. I'm going to size them properly. I'm going to not use huge lookup tables. That kind of simple. Avoiding allocations. Well, you can try to be smarter. You can use Slab or whichever technique you preferred name. Every location that is removed from a hotpad means that you are getting better speed and overall better memory usage because every time you are allocating you have chance that you are fragmenting memory. And in the case of Revi, in which we need to align allocation, there are chance in which we could manage to run out of pages if the job is long enough. So we had to focus on that, fix that, avoid really silly tiny allocation in hotpads because we did that kind of error. Last but not least, memory leaks. Leaking memory is safe or so Rust is cleaning. So it is possible even in safe Rust, quite unlikely if you are not doing crazy stuff. It never happened in Revi at least. But you have to be careful about that. The tools that I mentioned give you information about that as well. So how to run it? The memory profiler doesn't come with a run script so you have to write your own. And after that, it's still quite simple. Place it forward. How does it look? Well, not digging the data, but seeing what's going on. Nice graphs. You can check them, see what's going on. This is sort of what you would expect. So you are using more or less the same amount of memory. Benchmarking, and I guess we are running out of time so I'm going to run myself. For benchmarking, we don't have something that is great as a built-in in Rust. For testing, we all know and we mostly love what we have. Benchmarking is in a worse situation. Ideally, once we get the big testing framework overall, so custom tests will be first class citizen, everything will be easier. Right now, not so nice, not so good. Criterion is the best tool that you can use. And it's quite good for getting speed and throughput analysis out of box. So it's good to use. Sadly, we don't have much about memory. So changing the code, Rust is going to help us here. Well, yes. This strategy I like to suggest are first, maximize the impact, get the top five, pick the easiest, fix it. Feel good. Keep going and everything is going to be right. Try to be conservative with trade-offs. If you decide, oh, I can pre-compute everything in a lookup table and the lookup table is one gigabyte or two, you have some problems. Speed-wise, you are great. You cannot run the application many times at the same time. Last but not least, if you are working in a team, expect that your code, your smart optimization is going to disappear. It can happen. You can do that yourself. Somebody else can do that. Do not feel bad. It does happen. It makes sense. It's not disrespectful. So if you find something that doesn't make sense, remove it, it's fine. So what we can do? First, think harder. Use less resources by using a better algorithm. Rust is not going to help you. Second, do it better because you have better tools in your CPU. Simly is your solution. You have to think a bit, but not as much as thinking about complexity and get the best algorithm for your specific use case. Case, locality, and SIMD are interesting, having interesting interaction. Rust is not going to help you much about it, but the compiler is still doing a good job. Last, use more resources. Use more memory. Don't do that. Somebody did that. Not in Ravi. Use more threads, within reason. Multi-threading and Rust are great. So what we did? First, SIMD. Ravi and David are sort of twin projects, so we share stuff. We share also developers. How to use assembly in Rust? We have good integration. NASM-RS and CC-RS cover everything for you. Quite simple, quite straightforward. You don't have to think much about it, but writing assembly is pain, or if you are OECD is quite rewarding, but it's still pain. What you do with Rust? We have good intrinsics. We have even built-in CPU detection, so you don't have to reinvent the wheel for each project. The standard library does that for you. So you can use that. Also, the compiler can do the work for you. If the compiler knows that that ascension is available for all the CPUs, it's going to use it, and the compiler is quite good at auto-vectorizing code, iterators do unroll and do vectorize quite well by themselves. So if you write the code the right way, the compiler is going to work for you. On top, if you want to enable an X2 because all your CPUs are going to have it, you end up with a really faster binary, mostly for free, assuming that you don't have to support legacy. Multi-trading. Multi-trading in C, pain. Multi-trading in C++, utter pain. Most of the language, once you start using multi-trading, it means that you are going to spend the night on the debugger or trying to get while grind to figure out what's going on. In Rust, most of the pitfalls are prevented by the compiler. The standard library does provide you most of the stuff that you want. You want more? We have more. External craze support. More primitives and better primitives, faster primitives. Parking lot. You're spending a single byte for a Mutex. Fair enough. Not four, one. Good. Cross-beam. Better channels. We want to do more. We want to have better data structure. Cross-beam is serving you. We are lazy. We are really lazy. We managed to learn about iterators. Rayon is going to give you an automatic thread pool. A really easy way to move from your serial iterator to a parallel iterator. And you do that usually with a single line of code. I'm not kidding you. This is our main encoding loop. A little complex, but I mean, it's just iterators, right? Okay, this is multi-threaded. That's it. Simple, right? Fair less concurrency. Well, lazy concurrency. Obviously, if you are using Rayon, you might be suboptimal in some ways. Ravey and Rayon currently are exchanging notes on how to improve the whole situation. So the future Rayon will make Rayon even much faster. So you remember this? This is the encode tile that we saw here. So before Rayon, that thing was that long. Now it's like this. So kind of simple. You have to care about it. So we love Rayon. We also love course beam. The future API change will have a channel-based API. So if you like channels and you like encoding, you will be quite happy because we'll be extra straightforward. And if we are talking about memory, what we have? Most of the problem that we had were because using vectors is way too easy. And every time you are using a vector, you are using heap memory. So you're allocating. You are poking the kernel in the wrong way. It can be a problem. How to avoid that and keep using vectors because the API is so nice? We have solutions. ArrayVec, SmallVec, TinyVec, pick one of them. I mean, they are more or less the same for the user. ArrayVec is the first. SmallVec is one that is specifically geared for servo. So it's some kind of an hybrid. TinyVec is the smallest possible implementation of the concept. It boils down to, okay, I know the size beforehand, so I'm not going to have something that has to increase. So I'm just going to use an array as a backend storage or something that is similar to an array. So I can put everything in the stack. So first, cheapest to access. Second, no allocation. Third, if your workload is such that you are going to allocate the full size all the time or nearly all the time, you are not increasing the resolution set statistically. So this is a good way. You want to have richer data structure that have the same thing. I can mention ArrayDQ that is using the same complexes as a little better than an array. There are a number of other data structures that are doing that. There is container arrays that has a collection. Trade with care because the whole effort is sort of suspended. So if you like that structure, do help them. If you just want to use them, just be careful because they are just maintained but not developed. Other solution for you. Why you need the intermediate buffer? You can use iterators. Iterators are good and iterators are using less memory. Sometimes you have to turn around your code but the result is less memory and usually faster and usually even using SIMD. So if you want to spare memory, iterators can help you as well. So to show what happens when you are starting using the tools. First Rebi, we were doing that kind of amount of allocations about 6K. Okay, arriving to more or less pre-release. We went there. Nowadays, even less. So you can use this kind of tool and you can say actually you can see the impact this way. And I guess that's all. And I don't know how much time do we have for the questions. Okay, who's the lucky guy? Okay, you. Where can you find the slides online or anywhere? You can find that at the FOSDM website and you will have everything. You ask about slides, where to find them and the answer is FOSDM website. Yay, you. You see, you just gave us a bite for a remix but don't you ever film that? If something else is in the same cache line, then you're gonna have to evict variables to the different, how long cache, because if you have two threads and one is in one core and the other is in the other core and the one takes the mutex and the mutex happens to be here then the cache line is to move there. Therefore, all the data is in the same cache line also moves there. So if you then discuss that one and that thread also needs the same data then the cache line has to move back. Okay, so the question is about how parking lot is able to make the mutex that tiny? And my answer is, is written in the code and it's also documented, would take way too much to answer it properly. And probably I'm not the right person to do that. Somebody had the last question or we are. Question, I think we're gonna have to take it outside. Okay.