 Can everyone hear me? Okay, yep, cool. So yeah, as I mentioned, I'm Paul, talking about benchmarking and optimization of Rust libraries. And just as a little bit of intro to about who I am, so I'm currently principal engineer at Human Interest, Human Interest is a 401k processing company in San Francisco. You might be able to tell I'm not from there, so I've got an accent. And I've been working on Rust for a number of years, and as a result, I'm hit to create a number of libraries. One such library that I'll be talking about in a lot of these examples is the Rust Decimal Library, which is a fixed precision decimal number library written in Rust. So I've been told my accent's hard to understand, so I thought I would start off by giving a little bit of a 101 on the New Zealand accent. So essentially, if you take the vowels, you take out the O, and then you move the sounds of the vowels over one, that's the New Zealand accent. So if I say the word pan, you might hear this. If I say pin, you may be hearing that. And likewise, if I say pin, sometimes you might be hearing that. So if you're hearing the word in yellow, then you're gonna be good. But if you're hearing the word on the right, you may need to reverse engineer what I'm saying some of the times. So onto the talk. So the Rust ecosystem's growing. When I first put together these slides, it was just under 17,000. This morning it was around 17,700. It's very soon it's gonna be 18,000 crates. And so with all those crates, we're starting to use them in our projects. And we're implicitly relying on them and their downstream dependencies to really do the job correctly, securely, as well as quickly. So as a case point, really, is if you're using a decimal number library, you don't want that to be the performance bottleneck of your system. And you also don't want it to be the source of bugs as well. So to really be able to understand the performance of a system, the first thing we really need to do is be able to measure it. And really the idea of this is so that you can understand whether you're making positive or negative impacts to your library. And so just to really intro into this is that there's a couple of terminologies that you've probably heard around the micro and macro bench marking. So micro bench marking is where you're measuring a very small unit of performance. And macro bench marking is where you're trying to simulate customer application workloads. So when we're looking at some of these libraries, we may need to be considering both. And we'll go into some of that detail throughout this talk. So just to begin with, in order to benchmark, we need to know some of the tools that are out there. So cargo bench is the most obvious one to start off with. And the reason why is because it's included with cargo. It leverages the test crate. Test crates internals are unstable at the moment, which means that it can only be run on lightly only. But getting it up and running is relatively straightforward. So essentially we need to include the test feature and from there it exposes the bench attribute. The bench attribute indicates what functions should be called under when the benchmark runs. And within each of those functions, you'll see that there's an iter function. So everything inside there is what's being benchmarked. And in this first example, you'll see that the sub-functions being benchmarked in this particular example here. And the results being passed through to the black box function just immediately afterwards. And the reason that's happening is because we need to make sure that the compiler is not doing any sort of optimizations. That means that we're not actually testing what we think we're testing within that iter function. So running these results is pretty straightforward. It's a tabular format there. We've got the names of the tests on the left. We've got the number of nanoseconds per iteration on the right. And on the far right there, you'll also see the variance between the top and the bottom tests. So over time, you'll probably want to be able to compare these to make sure you're either doing things, things are improving or regressing. And so there's various ways of doing this. You could be using spreadsheets. You could be using all sorts of archiving strategies. But one tool that makes things a little bit easier is cargo bench comp. And that's a cargo plugin, which essentially if you pipe your bench results out to file, you can pass those files through to bench comp and it will spit out a tabular form like so. So the first thing you probably notice in this particular table is the red and the green. The red indicates that something bad happened, a regression, and the green indicates that something good happened. So either the same performance or an improvement in performance. So cargo bench is included with cargo, so it's pretty low effort to get set up. It's very fast to compile and execute and you don't need any external crates within your cargo Toml file. And with a good archiving strategy, you could also compare results from version three to version one if you wish to. The major downside though is that it's nightly only and the reason for that is that performance may be different and stable. You may be testing against experimental compiler features or whatever else, which means your results may be a little bit different. Cargo bench comp can be a little bit sensitive to thresholds. So in the previous example, there was green and red. I was actually running that against the same code without making any changes. So essentially, you can control the threshold, but you've just got to be making sure you're comparing like for like tests. So some of those tests were taken two nanoseconds. If it goes to the three nanoseconds, that's a 50% increase. The other thing with cargo benches, looping through sets of values can be a little bit tedious. And so with a decimal number library, you want to be able to test a range of values. Seven divided by three is not the same as six divided by three. A very different calculation that actually happens under the seams. And you can do this with cargo bench. It's just a matter of leveraging macros to be able to do all the boilerplate code for you. A second library which is out there, which is also viable to use is criterion areas. And that's inspired by Haskell's criterion library. It's been written so that it runs on stable by default, which means it also runs on beta and nightly. And getting that up and running is pretty straightforward as well. It's a matter of including the criterion crate and then overriding the bench commands so that instead of invoking the default bench, it invokes the criterion bench harness. And so within the code, relatively straightforward, we've got the criterion main function, which essentially is what is getting invoked by the harness, is the criterion harness. And instead of having the bench attribute, we have criterion group. So we had the bench attribute on each of the functions. Instead, you list them in that criterion group macro. The actual function that gets called is relatively similar. It looks similar in the fact it's got an iter function, as well as passing the result to a black box. However, the big difference there is it's wrapped around with a bench function. And so what that is there for is to name all the benches. So because you're not able to leverage the attribute, the bench attribute, you've got to be able to name them via a string for it to work on stable. One of the actual side effects of this as well is they do have bench function over inputs. So you can actually pass multiple inputs to a bench marking function also. The results of the both. So as you can see, it's not a tabular format anymore, but within each of those benchmarks, it does give you a lot more information. It's giving you the time, the percentage change, outliers, statistical significance, whether it regressed or not, et cetera, et cetera. So there's a lot more information there. And in addition, it spits out these graphs. And I actually haven't used these graphs much in practice. However, I can imagine this would be a great thing to show your boss, show that you're actually making an improvement there. So criterion's relatively easy to set up. And what I mean by that, it's an easy migration from cargo bench. The code was fairly similar. And it has some wayoffs, which turn out to be kind of nice in some ways. So bench function over inputs means you can pass multiple inputs to a single bench marking function. And it does provide statistics out of the box, which means you don't need a second plugin or whatever to manage that. The major downside is that it's quite a bit slower than bench. It takes quite a while to, I guess, warm up the tests and then run them. And I guess it really depends on what your use case is, but if you're wanting to run things quickly, do minor small changes, then this can be a little bit of a bottleneck for you. It does suffer similar threshold issues between iterations. There is a chat going on about that in one of the forums about introducing the ability to control threshold. But the previous example again showed regressions and nothing had changed in the code. And the last thing, which is more of just like a gotcha, is because you're naming your own bench marking functions, you've just got to be careful to not name them the same thing. If you do, then you get these weird subtle comparison errors and that can just really throw you off if you haven't realized what you've done. So before jumping into some of the practical things of performance, the other thing I just want to mention is just really understanding your application before running it under bench. So you can use instrumentation to help understand this. And the reason you might want to do this is to really understand what the internals of the application are doing before you... Understand what the internals are doing so that you can really make sure that what you're testing against is what's expected from a normal application run. And there's various tools to be able to do this, but the whole idea here is that you understand what are both expensive calls, but also high invocation calls. So in an example, for the decimal library, for division, for bigger numbers, you might be calling a subtraction function maybe 1,000 times internally, which isn't exposed by the public API. And so it's important that that particular function is tested under those sorts of loads. So we'll go into some of the practical things, and this one's gonna send them a little bit of a cop out because it's going back to the fundamentals. But essentially, some of the biggest wins you can get is really from re-looking at your approach. There's things such as early exit conditions, there's operational efficiencies, using bitwise shift and see if dividing by power of two. There's parallel operations, you've got dynamic programming, use of efficient types. We'll jump into use of efficient types a little bit more in this talk and just go into that a little bit more. But just as a bit of an example, we'll just talk about Postgres write performance. And so Postgres has a numeric data type, which you're probably all aware of. And to write out to the Postgres protocol, you need to essentially break a number into groups of 10,000 and then write out the number of groups that you broken into followed by the weight of the integer portion, science scale, and then followed by the groups of 10,000. So if we were to write out the number 3.14, then essentially we need to break it into groups. So we group, break it into the integer portion and the decimal portion. And the integer portion, we actually pad to the left with zeros to make it into a 10,000 group. And on the decimal portion, we pad to the right to make it into a 10,000 group. And so when we're writing it out, we write out two groups in this case. The first ones are integer portions of the weight zero. And the scale in this case is two because there's only two decimal points that are relevant in the 1,400 group. So that was relatively easy to explain from a base 10 perspective. If we had it in a string, for example, it might be quite easy to reason about how we might be able to implement this. We essentially chunk it into groups of four and pad the zeros and output the results. So if we were to do that in Rust, we might do something like so where we convert it to a string, we find out where to split the number so that we've got an integer portion and a fractional portion. We can then go through and chunk it by fours, padding to the left for the integer portion and padding to the right for the decimal portion and then writing it out to the protocol with the big Indian format down the bottom. And so if we were to do it this way, then say we had 15 samples that we're testing with, it roughly takes around 15,000 nanoseconds per iteration. So around about 1,000 nanoseconds per iteration for this particular example. And of course I wouldn't be talking about this if you couldn't do it better. So what if you didn't have to convert it to string? Well, you don't. If you take a step back and think about the pure math approach about how to do this, then you may think about how to scale up the number to the closest 10,000 group, keep dividing it by 10,000 while storing the remainder and keep going until you reach zero and you've got your groups. So we took the 3.14 example again. We scale it up to 10,000 boundary, which is 31,400. We divide it by 10,000. We get three with a remainder of 1,400. And we divide three by 10,000. With a remainder of three, we've got our groups on the right there and that's without having to convert it to string first. So implementing this in Rust, we could do the following where we essentially figure out what we should be scaling it up to, so the closest 10,000 boundary. I'm also using a couple of other operational efficiencies here which is instead of using the modulus and divider functions or the remainder divider functions, instead just taking the last two bits of the number and likewise also bit by shifting over to instead of dividing it by four. So we've scaled up the number. We then divide the number by 10,000 storing the group into array and finally we write it out to Postgres. So the performance of that is quite a bit faster. It's 5,000 nanoseconds versus 15,000 nanoseconds. And so we're getting a three times increase just from really looking at our approach. The first one was easy to reason about but the second one, if you take a math approach to it then you can get some clear wins just from doing that. So the next one I wanna talk about is fixed size slices performing better. So if we have two identical functions like so where one of them takes a fixed size slice of three and the second one doesn't, then the hypothesis is the first one runs faster. And you're probably looking at this and thinking, well, obviously it runs faster pool. The second one does all these bounce checks which the first one doesn't do. And you'd be right in this particular case. So the first one, well one of them takes two nanoseconds per iteration where the one with bounce checks takes three nanoseconds and within a loop we get 2,100 nanoseconds and 3,700 nanoseconds. So it starts to broaden a little bit there. But what if we didn't have those bounce checks? What if we had this particular code here which is of course a little bit dangerous to have in your production base but nonetheless, how does that perform compared to a fixed size slice? Well, in a single iteration sense it's close. It's two nanoseconds per iteration but as we get into a loop, we start to see a little bit of difference there. So we have 2,100 nanoseconds again versus 2,700 nanoseconds. So what this is saying to us is that by giving the fixed size slice the compiler's able to make various assumptions or various, I guess, heuristics to be able to understand that they can make further optimizations that it's unable to do without having that fixed size slice. So fixed slice slices are indeed faster. So following on from that, inlining can both improve and hinder. And so with that previous example there, say we just throw inline always onto the top, what's gonna happen? Well, if you said that things are gonna improve substantially, then you'd be right. But interesting to note is that both of them behave exactly the same way now. So with inlining, both of them take 1,300 nanoseconds per iteration for the multiple iterations and one nanosecond for the singular version. So what that's really telling us is that the compiler's able to make various, I guess, conclusions by having the code inline. So it's able to optimize the compiled code so that everything runs faster. And inlining for the second example of the bounce checks, well, it knows that the size of the array that I'm passing through is three so it can basically remove all those bounce checks and make things faster as well. So with inlining, the common thing is is, well, that I often hear is, are you smarter than the compiler to know when to inline? And essentially, I think if you're measuring things and you understand the impact of what you're doing, then sometimes I say, yes, you can be smarter than the compiler. And so with a library though, I think the main thing to call out there is that just to be wary about the impact you're gonna be having on your downstream components so all the consuming applications. So it will increase the binary size and sometimes you won't get the benefits you want with the increased binary size. So my general rule is to be very wary of inlining public API functions but go nuts with your internal stuff if you know what the impact is. So the fourth one is primitive types are almost always better and so the first one's a little bit of a contrived example but bear with me. So if I was representing a decimal number, I might represent it as 10 divided by 10 to the power of E. And so if I wanted to represent pi to more precision, then essentially I have a bigger integer portion and a bigger scale to help represent that. And an obvious way of representing this particular number is with a big int. I mean, that's just a single number. It could store the m portion and a scale. We could store it as an array of four U32s. So have a 96 bit integer. So we have a cap there and a fourth one to represent a scale and a sign. Or, you know, since Rust 126, we could also use a 128 bit integer. So that's pretty cool. So we'll take a look at big int versus U32, the array of four U32s. And as an operational example, we'll try to add together two decimal numbers. So adding together two decimal numbers is relatively straightforward. Essentially, you need to have the numbers at the same scale to be able to add them together. So if I have 2.5, which is scale one, plus three, scale zero, then I just need to scale up the three to scale one to be able to add them together. So 25 plus 30, that's easy. You have 55, scale one, so 5.5. And so big int's a lot easier to reason about this. It's one single number you need to scale up. And if you have an array of, you know, three U32s, then you need to start thinking about words and how the overflow happens between binaries. It's not just a simple addition anymore. You need to essentially add the bits together one by one. And so with a naive implementation of both of these, you can see that the array version does actually behave a lot faster than the big int version. However, that's not really what I wanna go into on this particular part. I really wanna talk about exploiting fixed size primitives. So by knowing that that array is 96 bits, we can make a lot of assumptions about how we add together or do certain things with that particular number. And so shifting on from addition going to division, a naive implementation of division might look like so. And I'm using 32 bit numbers here, but essentially the concept's the same. So the two's complement version of division, if you're not aware of it, is essentially you take the dividend and you keep minusing the divisor every time you're successful, you increase the quotient. So as you can imagine, that doesn't perform very well. So for a smaller numbers, you can get really good results, 80 nanoseconds. But as you get bigger than 90 million nanoseconds in iteration is not acceptable. In most cases probably. So we know that the size of the type is 32 bits. So with that, we can also make various assumptions about how that two's complement division works. So instead of looping through n number of times, depending on the size of the number, we can limit our loop through only 32 times. And basically exploit the fact that it's a binary number. So we can exploit the fact that when things carry over, we can, they disappear from the number when we shift bit shift left. And we can carry them over with various tests that we need to. I won't go into this math here, but essentially the idea of it here is that we are now only looping 32 times. So we get around about 42 to 46 nanoseconds per iteration for this division, for all of those particular numbers. And I do want to call out here is that the hundred is actually slower in this case. So we do have 46 nanoseconds per iteration over the 18 we had before. But it's a reasonable trade-off considering that we now have a very consistent performance across the board. So the other thing I just want to call out here is this is another example why you should be testing across a range of inputs. If I just have one test around for a hundred or even 200 or 1,000, then it's not going to be incredibly bad performance. But as I start to get to the bigger numbers, it starts to get horrible. So the fifth thing I want to talk about is copy-borrow semantics. And essentially when we first start getting into Rust, well, I'm sure a lot of us have battled with the borough checker before and we've done things like this to get around it. And obviously that's a warning sign because cloning to get around the borough checker isn't the way forward. It's just the way forward temporarily or whatever else. But essentially if you're avoiding the borough checker, that's probably a bad sign. And the reason being is copying or cloning a large struct, we all know that that can be expensive, but essentially it can also be expensive if it doesn't fit on the stack. So we need to be considering the size of the data that we're really copying in these particular cases. So as an example, we'll take a look at the one we had before, which is the fixed size slice. So we've got the fixed size slice of three there and then on the bottom one there, the copy, we don't have the borough. We copy the array instead of borrowing it. And so if we were to look at the performance of this, then we can see that for single iteration, it looks fine, two nanoseconds each. But once we get into bigger loops, then we start to see a little bit of a variance. And to be honest, this particular benchmark is an interesting one because it varies substantially between runs. It really depends on how the process is working at the time, but essentially this one has roughly around 2,400 nanoseconds versus 2,900 nanoseconds. So copying in this size, just 96 bits is absolutely slower than just borrowing those 96 bits in this particular case. Again, if it can fit on the stack, it's fast, but if it doesn't, then it potentially could be a bit slower. So what about unsafe? I mean, unsafe is a great way to improve performance, right? So as library authors, I think that, well, the definition of unsafe really for within Rust, from my understanding is you're basically telling the compiler, hey, I know better than you. I know what I'm doing, trust me. And while that's fine when I'm writing the code, I trust myself, but of course I cause a lot of bugs as well, but essentially the community doesn't necessarily trust me like I trust myself. So having unsafe within your code is something that, if you can avoid it if possible, then you should be avoiding it. And I have done a little bit of experimentation within the library and I have found that in the cases that I've been looking at where I've thought, hey, if I just had men move, that would be a lot faster. I've actually found that there hasn't been any noticeable difference. And I think that's kind of interesting. I think that's something that is within everyone here. If you're measuring your performance and you see that there's a huge performance increase, then you can make that trade off only if it's absolutely necessary. And so I think there originally I was jumping to the conclusion that unsafe was always gonna be better in terms of performance. It wasn't necessary of the case. And so I'd probably be interesting to challenge challenge yourself by just measuring your results. So just as some closing thoughts is that, as library authors, we have an implicit responsibility to consider performance. Small details do matter. And it's really important that you understand how your application is working under load or under normal circumstances in order to really benchmark it effectively. It's important that you're testing across a range of inputs. It'd be a shame if you just were testing against dividing by 100 before and the billion was turning into 90 million nanoseconds. So it's important that you're really considering a whole variety of benchmark tests. And the main purpose of measuring is really so that you have insights so you can make effective decisions going forward. So all of this has really been a result of experimentation. So really understanding what's there and knowing that you wanna make the improvement and then absolutely testing something and I guess testing a hypothesis and seeing if it actually worked. So all in all, this community is awesome. You guys are great for coming out to the RustConf and we can keep making it awesome by really working together. And so thank you for listening. If you do have any questions, then you can come and see me in the break. But I did also wanna just say a special thanks to all those people that helped me build these libraries. You guys are awesome, so thank you.