 So today we're going to talk about understanding performance of Rust programs in libraries. And we're going to talk about some Rust specific pitfalls. And I'm going to talk about some tools you can use to see the performance of your Rust programs, some of them you can even use today. It's all on your laptop. I am an engineer. I work at Boyant. We make thanks to our design team for some great transitions. We make LinkerD. LinkerD is a cloud-native service proxy. We build what's called the service mesh. And here at Boyant I work on some Rust and Scalicode. One of the things that we've built is something called LinkerD TCP, which is a TCP proxy. It's written in Rust, and it's designed to work in cloud-native environments like Kubernetes or DCOS. We have more protocols coming. Right now it's TCP, but we're adding HTTP too. Just today we announced that we would call open-source the H2 repo on GitHub. You can take a look at that. LinkerD TCP, just like LinkerD is a Apache license and free for everyone. One of the things that we've built to enable us to build LinkerD TCP is something called TACO, which is a stats library. It allows you to instrument your applications. And as a stats library, it has some basic features. It has a counter, so how many times something happens. It has timings, so how long something takes to occur, or to some bit of code to run. And then gauges, which are values at a point in time. This is also Apache licensed. I'm going to use TACO. I'm going to use one of our macro benchmarks as sort of a draw. Two of our macro benchmarks as a driver for talking about the topics that I'm here to discuss today. Here is a macro benchmark that I wrote, and I'm going to talk about what macro benchmarks are here in a second. But I wanted to give you a sense of what TACO is before we start digging into it and looking at graphs and performance and charts and that sort of thing. So here is a multi-threaded macro benchmark where, here you can see we have a timing. We measure how long it takes to do an action. We have a loop about 10 million times that we do this inside of a single program. This macro benchmark driver called multi-thread.rs. And what do we do here? We set the current iteration, we increment a counter, and this is the basic work we do. And then we do this 10 million times, and at the end we send a signal to a channel as we're done. And we're going to come back to this in just a second. So here's the outline of talk. First, we're going to talk about some causes of slowness in your programs, some rough specific pitfalls tools, and then we're going to talk about this great measurement called IPC. So in native programs, there are generally three causes of performance problems. There's memory stalls, which is talking to DRAM essentially, lock intention and CPU utilization. When we talk about memory stalls, really what we're talking about is the memory hierarchy, where these aren't exact numbers. These numbers are here to give you a sense of what the order of magnitude changes when you move up levels, when you're talking to memory. So a register takes like half a nanosecond to talk to. Your last level cache is going to be like 10 nanoseconds. So if you remember in a CPU, you've got like a L1 cache, you've got an L2 cache, and then sometimes you have an L3 cache. Each layer of the cache adds another. Is larger and slower and cheaper? Well, they're the same cost. Anyway, so again, half a nanosecond, 10 nanosecond, then talking to DRAM is going to be about 100 nanoseconds. So that's just a quick overview of the memory hierarchy to remind yourselves. Then we have lock intention, things like spin loops and blocking weights. These are things that can cause performance problems in your program, and we'll see one of these here in a moment. And then CPU utilization. The reason I talked about, I want to talk about CPU utilization last is because it generally hides a lot of the performance. Usually, a CPU utilization can hide the other two things that I talked about. They can hide memory latency through a slow instruction. You can't, it looks the same. What am I trying to say? A given instruction, you can't tell if the instruction's slower if you're talking to memory when you're looking at it, is what I'm trying to say here. And instruction can hide lock intention like have a spin loop. When I say it, I really should say a function here. Idleness is often counted as useful work, like you look at top or H top and you'll see, oh, I'm using 90% of my CPU, but that can also mean you're spending 80% of your time waiting for a RAM or disk. And I want to talk a little bit about some Rust specific pitfalls. Here's a big one, like derive copy and large structs. Copy is great. Copy makes your programs much more ergonomic. They can be a real lifesaver, but if you find yourself overusing it, you can accidentally, without meaning to, the great thing about clone is it's explicit, but copy is implicit. So if you find yourself implementing copy, you can find yourself killing your DRAM bandwidth. And the most common reason is it was small when I started, and here's an example. You can see someone might make a person struct where they just have an A of an int, they have a reference to a string, and then somebody decided, well, let's put the whole user's DNA in there. And that's 800 megabytes. It should be a reference, or maybe not there at all. And then again, we're talking about, we're talking about copy, and we can talk about clone. Doing clone on a loop is something that can kill performance through saturating your DRAM bandwidth. The nice thing about clone is it's explicit. You'll see it in a profile. You'll see clone being there at the very top. Here's just a sort of a contrived example of going through a people vector and pushing them into another friends vector. And then something that pretty much everyone in this room is familiar with is that the default hasher is cryptographically strong, so it can be a little slow. It's a well-known trade-off to experience Rust programmers. It's really surprising to brand new Rust programmers. They maybe aren't used to this, seeing this before. The nice thing is there's lots of great alternatives for using different hashers. And there's good pages with benchmarks, Rust-specific benchmarks for these different hashers. And here's an example of using a different hash, one that's a little bit faster. Here we use an FNB hash. There's no reason to show the slow version. Everyone knows what it's like to make a hash map, so I decided to show what it's like to make a slightly faster one. And again, it depends entirely on your use. You can pick from one of the great alternatives. So something else that can bite us is using expensive arguments in places where maybe we don't expect the code to be called, like expect is a good one, like you often expect your code reasons called, expect me to expect it to work. And then, but this isn't, you can put something slow and expect, like for instance, this was pulled out of one of Eliza's libraries, ended up being in the hot path, and so this format was called 10,000, the bazillion times every second, and pulled it out, did something a little cheaper, sped up the program dramatically. Again, this isn't specific to expect, it has to do with the fact that Rust is eager, and you just have to remember about the evaluation order in the program. Last, I want to talk about pre-allocating VACs. When you look at experienced Rust programs, programs you see that they pre-allocate VACs everywhere, like they hardly ever allocate. And here's an example of where we do the Selenker DTSP, we have a struct that has a buffer size and a default size, and then here it highlighted is us making the buffer that we use to shuttle data around from inbound to outbound. And now I want to talk about some of the tools that you can use. I'm going to talk about some tools in the Mac, like instruments, Cargo Bench, Gargo Bench Compare, and also some Linux tools that we use because our primary deployment target is Linux, where we use Perf and Flamegraph, VTune, and again also Cargo Bench and Cargo Bench Compare. So Cargo Bench is a micro-benchmarking tool. I'm sure many of you use it. It's part of the standard tool set. And it's really great for what's considered sort of like the high use parts of your API you should consider writing micro-benchmarks for. Here's an example of a micro-benchmark using Cargo Bench that we have in Taco where we benchmark how long does it take to make a new counter. And sort of the output will look like this. Here's a mix in with a bunch of other benchmarks that we've written, where we tend to write up, here you can see I've highlighted 23 nanoseconds, 47 nanoseconds. And those numbers look great. Everyone would probably be happy with those numbers unless you're really paying attention. Cargo Bench Compare is something that's relatively new. It allows you to compare two Cargo Bench runs. This is really great for avoiding performance regressions. You know, someone ships you a PR. You look at the PR on GitHub, you try it yourself. You can do a textual comparison, so Cargo Bench does. And you can see, well, this does help or doesn't help performance. In this particular example, I borrowed this graphic directly from the Cargo Bench Compare GitHub page. You can see it doesn't really make much of a difference even though there's lots of green and red. The numbers themselves are not that different. While micro-benchmarks are really useful and you should write them, what we've found is that writing macro-benchmarks is another great, really good use of time. A macro-benchmark is taking your API and using it sort of in the same context that your users would use it. The reason that micro-benchmarks are limited in utility is that you run something a handful of times and you measure it, so you don't really know how well it's running in the context of your CPU. So in the micro-benchmark that I showed you before, we exercise the API in a loop of 10 million times and we'll talk about why we do so many iterations. So when you construct a macro-benchmark, like that example that I showed earlier, you have to remember to compile with release because you wanna have optimizations turn on. It's a mistake to profile code that's built without optimizations because you might find yourself fixing something that you don't care about, that the compiler would just fix for you by turning on the optimizations. And one challenge of this is you have to add symbols in because if you don't have symbols, you're just gonna see a bunch of address spaces in your profile. And so just a little bit of text in your cargo tom will give you some of your symbols. So, again, back to constructing a macro-benchmark taco, as I mentioned earlier, has two macro-benchmarks. We have a single-threaded benchmark and we have a multi-threaded benchmark because these are the two different ways that people tend to use taco. And this, again, is the same slide I showed before of the macro-benchmark, the multi-threaded macro-benchmark. So what's going on is there's a couple of threads that are being spawned. They're calculating timings and adding them. And then we have another thread that's doing the reporting work. And then here, just a really quick segue, using instruments to look at the performance of this multi-threaded benchmark. Here I use a measure called IPC, which is instructions per cycle, which I'll talk about what that means here in a second. This is really just to give you a sense of what we're gonna talk about more here in a moment. So we have stumbled upon using IPC, I shouldn't say stumbled upon, it's a fairly commonly known metric. We use IPC as a useful empirical metric. It tells you how many instructions are completed every clock cycle. When we learn CPU architecture in school, we tend to learn it as you have a program that one instruction then runs the next instruction and then the third and so on. And that's just not how it works on model until CPUs. Like this is sort of the model that we learn. As you can see here, this again is a serial model. You run the first instruction, then the next instruction, the third instruction, and here in this diagram, you see I've broken out the other stages. I got this graphic in the next one from a great page called 90-Minute Microprocessor Guide, which you might enjoy. But serially is not really how programs run anymore. They run more like this. They run, as soon as one fetch stage is run, the next fetch stage can start. As soon as one decode stage starts, or is finished, the next can start. And you don't just have one execution unit anymore inside of a single core. Now you might have three or four or five. And even more complicated, instructions can, each one of these little boxes can depend on another box. So instead of having a full CPU full of work, you can have memory stalls, the things we were talking about before, memory stalls, locks, can put a lot of empty space inside of, inside of your pipeline. So, given how complicated things are, and since instructions can depend on each other, and we can have huge bubbles in the pipeline, how do we know, how do we know we're doing well? Here we enter the realm of the performance counter. So Intel engineers had the same question, how do we know our program is running well, given how complicated a modern CPU is? They added something called the performance monitor counters. They'll let them, just like the counters in the benchmark I showed you before, just track how often things occur. And they allow you to do things like calculate ratio. An enormous number of counters inside of a modern Intel CPU. There'd be hundreds of counters, hundreds of documents, hundreds of pages of documentation that could be really tedious to pour over. And, but the nice thing is that if you find a handful that you find useful, you can use them to derive metrics for your program. And instructions per cycle is one of those metrics that we use. Instructions per cycle is how many instructions can a core retire in a cycle? Retire meaning it's finished. The useful work is done. And what we found is a good rule of thumb is that if you have an IPC score less than one, it means your memory's solved. If you have this IPC score greater than zero or greater than one, it means your instruction's solved. You maybe need to run fewer instructions. And you can, for your machine, you can learn this empirically. Like if you have a deployment target that's like a modern CPU that has like, this five wide, you can, or one that's three wide, you can see how well your program runs. But you can also learn whether 1.0 is the right metric for you empirically by writing a program. You can write one that's memory-stalled. It does nothing but DRAM fetches and writes. You can write another one that does nothing but, you know, works entirely out of registers. And then you can see whether, you can determine whether this rule of thumb is appropriate or not. Again, reminding you just how complicated a CPU is. This being a three wide, which is what you see at the very beginning, this could have an IPC of three, as I mentioned in that slide. If you have a max IPC of three instruments on the Mac, so it has IPC built in. Again, this is going back to the earlier slide. You can, what I've chosen to do here is look at the IPC for the multi-threaded benchmark. And you can see that we have an IPC score of less than 0.5. You might think, oh, it's 50% utilized. That's great. But, you know, it's a three or four wide CPU. That's like 50% of four. That's really poor. That's like maybe a zero point. That's like maybe 10% utilized. And here's, I've done the same thing with the single-threaded macro benchmark. IPC score is about 0.8, a lot better, quite a bit better. Yeah, so all these counters are available directly in instruments. What you do is you would make a new counter instrument and then you would long click on the recording button. You'd pick an event and then you can, you would have a really just daunting drop down of performance counters. It's sort of assumed that you've read these hundreds of pages of documentation. And really what you're gonna do is you're gonna find a handful of them that work well for you. But the nice thing is you can actually create formulas from these performance counters. IPC is the only one that ships with default. You can make your own. Here's an example using instruments where I'm looking at, I can tell it's hard to read from there, but I'm looking at L1 hits and L1 misses for the multi-threaded benchmark. So I specifically went through the drop down, picked L1 hits, L1 misses, because I wanted to see what they looked like for this multi-threaded macro benchmark. Here you can see we have about 20 million hits for the L1 cache and about half a million misses, which, you know, isn't great in my opinion. And then another use of instruments can be to sort of look at your typical waterfall of methods and see how long each individual function is taking. And you can see the heaviest stack trace over there. Yeah, so lots of, you know, it's an easy to use performance tool. It's a little limited because it's cocoa-specific, but since you have the performance counters available, it's actually much richer than a lot of people give it credit for. And I suggest you give it a try. On Linux, as I said, at Buoyant we deploy most of our code on Linux, so we dig in with Perf. And Perf has been part of Linux for a few years and they've been constantly improving it. They put a ton of work into Perf. It has both kernel and user space profiling. It's a sampling profile with a configurable sampling rate which you can play around with. You know, Nyquist there tells you that if you wanna see something that happens at rate X, you should sample it to X. And so the nice thing about being configurable is that you can really turn it up. And it's also pretty cheap. They designed it to run at low overhead so you can run it in production at a somewhat moderate sampling rate. Now here's an example of looking at the IPC metric again of our program using Linux Perf. And I've highlighted that it's roughly 0.5 instructions per cycle. This is a different CPU. This is a dedicated machine we use for performance work. There's a different CPU which is why the instruction scores a little bit different. And here I use Perf to look at the cache hits and misses. Here I'm looking at the L1 misses and hits and you can see that for the multi-threaded benchmark I have about a 5% miss rate which isn't great. For the single-threaded one I have a 0% hit rate which is great or excuse me 0% miss rate. The 0% hit would be pretty poor. Yeah, so I really encourage you to dig into this. If you run software that's supposed to run on Linux you should really dig into Perf. It's much, much deeper than I had time to go into considering that I'm talking about other tools. They're all very Linux specific. You can do things like you can dig into this. The Linux scheduler, you can dig into the IO network subsystems. Incredible amount of tooling being built around Perf. One of the things that we use with Perf is we build flame graphs. And a flame graph is a sample. It takes us, again like I mentioned, Perf is a sampling profiler. So you collect a bunch of samples and then you aggregate them and to get a sense of what is the shape of your program. And here is what one looks like for our single-threaded program. So like I said, you take, let's say this is a thousand samples and then it aggregates them. The width is what's on the very top and you can mouse over to see what that is. I know it looks incredibly tiny. You can actually drill in by double clicking and then you can mouse over to figure out exactly which function it is. What's on top is what was running on the CPU when the sample was taken and then the width of what's below it gives you a sense of how often that was seen in a sample. So here we can see we get a clock, we get the time, we make an instant, we call a map on a future. So it gives you sort of a sense of your program. That was a single-threaded benchmark. Here's the multi-threaded benchmark, much more complex as you would expect. You've got a couple threads, we've got a mutex. One thing that we do a lot is we'll just drill into some part of this and I just give things a look from time to time to make sure things are looking sane. Really useful for looking at long-running programs. Since what we ship is a proxy that runs 24-7 in people's data centers, it's really nice to be able to get a sense of how does things look at time. T versus like 30 minutes from now are things changing or as events happen inside of the system, how does the shape of your running program change? Yeah, Netflix pioneered this technique, really, and they use this to measure the health of all their online services all the time. They're always building flame graphs of all their online services. The one thing that's tricky here is you've got to remember you have to get symbols, just like I described earlier with profiling. You have to remember to add the symbols. Otherwise, what you're going to get in the graph, in the flame graph, you're going to get a bunch of addresses, and good luck. So the last tool I want to talk about is probably the most complex as VTUNE. So like I said, Intel engineers have the same question that you all have, it's like, how well is my program running? So they added performance counters. Now they have another problem, what all these performance counters do? So they made this commercial tool called VTUNE, which helps you sort of make sense of what all these performance counters do. It's an extraordinarily complex tool, but you can get your head around it with some practice. It has tooltips and a GUI that works with SHX forwarding, has a CLI, and if you're an open source developer, you can get a free license. Here, I'm looking at, hmm, this is the single threaded benchmark. I really should have labeled these graphs. Thankfully, I've looked at them enough to know what they are. This is the single threaded benchmark, and according to, this is the general exploration view. You can see there's a fair amount of detail here. I can tell this is memory bound. This is back and bound specifically on memory. And looking at the multi-threaded benchmark, it's also back and bound, but something really stood out to me from running this, which was that I was bound on a remote cache, and we'll come to this, because this is a lesson I learned from preparing this talk. I didn't realize this about this particular benchmark. This is the multi-threaded benchmark. Here's a view of, I believe this is the memory explorer. Here we're looking at, this is the single-threaded benchmark I can tell from the miscount, and you can see stacked CPU time, but then secondarily, what's memory bound? And what's really great about this tool is you can mouse over any cell, and it'll give you a useful tool tip of what that means. It's really helpful when you're starting to use a V-tune. It's such a daunting tool. It'll also give you the formula that it uses to drive many of these, which is something which will be like this counter, minus that counter divided by some other counter, and that's an example of a formula. You actually take those formulas that you learn and plug them into instruments for when you're doing development on your laptop. Here is the multi-threaded program. Miscount's pretty high. That's how I can tell this is the multi... Oh yeah, it also has the name at the very end. Here's the multi-threaded benchmark. Here's the locks and weight view. So typically when you use V-tune, it has a lot of different analysis types that you can run, and you would start by looking at like the general exploration view like I showed at the beginning, and then you would run it over and over again using the different types of analysis you would run your benchmark. That's another great reason to use, to write a macro benchmark, is that if you have an example application, you can hand it to Perf to run. You can hand it to V-tune to run. And then you have just an easy tool that you don't have to remember how do I run this? Like what scripts do I need? You just run your macro benchmark in the tools. Here we can see that there's very low weight count. And in the multi-threaded benchmark, you see we have a very high weight count. Like here's seven, we waited seven million times. And that's no good. This is another view. This is the event count view. You can see from the progress bar at the bottom there's a tremendous amount of detail here. This is kind of like the day trader or fantasy football view of your performance. I wouldn't encourage you to spend a lot of time here. But it is something I mentioned earlier that there's a GUI and there's also CSV you can get. This is something you can get CSV from and you can do some post-processing on to learn specific insights for yourself. So that's sort of the takeaways from V-tune is that Intel knows their CPU is better than anyone and they know these hundreds of pages of documentation and they know how to build these formula formula for the different performance counters and they can surface them in V-tune. It's really, it can be really overwhelming and the tooltips are helpful and all the different analysis modes I found it really useful. I only dug into maybe a third of the analysis modes in my screenshots here. Oh, that's, and I learned a lesson. While I'm preparing this talk, I learned something. V-tune highlighted this remote cache issue and what I learned is that I was remote cache bound. And what this means is I was talking, this machine has two physical CPUs and one thread was running on one CPU, one thread was running on another CPU. I didn't realize that when I was preparing the talk. The screenshot is what showed me that. And there was also a really great, I meant to put a screenshot in here, of a memory view that shows you for all the DRAM sockets, like the different, like the physical RAM, the stick that you'd plug in. It shows you all the memory use for that and also the memory use between the two CPUs, which was also like just a dead giveaway that there was a behavior I wasn't expecting. So going back to Perf, you can see that I had that 5% miss rate that we talked about earlier. So what I, this is the multi-threaded benchmark. What I decided to do was, well, I'll just use task set. I use task set, I'll run on one CPU. We'll see how it improves. And then the miss rate dropped to 0.01%, which was pretty fantastic. And then you can see the total runtime dropped from nine seconds to 3.8 seconds, which was pretty great. I was pretty, I was really pleased with that. IPC drop, we were talking about IPC as being a useful empirical measure. IPC increased about 10%, which tells me there's still a lot more work that we could do to improve that. Yeah, so performance is hard to understand. Like given how complex CPUs are, there's a lot of tooling. You really need to measure it empirically. You need to run your program and see how it does. There's a lot of great tools that you can use. IPC is a really great measurement that you can use. I've been really pleased at how many tools there are to use them to measure performance of programs today. You can use instruments on the Mac, which is much more powerful than people realize. You can use Perf on Linux, you can use VTunes, but ultimately the best tool is the one that you use on a regular basis. And I wanted to give a special thank out to Eliza for walking me through instruments. And thanks, thanks for listening. Thank you.