 Hi everyone. My name is Evan Chan. I'm a senior data engineer at UrbanLogic, and welcome to my presentation. So UrbanLogic, we are an online platform that provides insights, and we use Rust and machine learning amongst other things to give really great insights for transportation, economic development, and other community use cases. So why thin cloud apps? If we look at the progression of technology through the recent years, we can see that we have moved from in the old days we used virtual machines to now we're using containers everywhere, and we can see technologies on the horizon that are coming on fast, such as serverless and WebAssembly. And what we notice in this trend is that cloud infrastructure is getting smaller, thinner, and more concurrent. So things that maybe didn't used to be quite as important, such as memory and allocations are becoming more important, because when each unit is smaller, you need to be more efficient. And we also live in a data-rich world, so we need to process more and more data. So that has to be more and more efficient. Finally, using less memory, it's more eco-friendly, right? So the question for everyone here, we're here to learn about Rust is why use Rust for thin cloud apps. And they are actually made of reasons for which I personally came to Rust. I came to Rust from writing a distributed in-memory database called FileDB, which was on the JVM. And I came for the no compromise aspect of Rust, that it was so to me that you could get performance, safety, and abstractions at the same time. Usually, you have to choose one or two of these. And I found that it was pretty much mostly true, that you could have all three things. And also, with Rust, you get great control over memory usage, allocations, and you have many ways to even opt out of using the standard library, so there's many ways that you could use less memory. And that small profile makes it very appropriate for writing everything from apps to OSS, kernel-level code, hypervisors, and so on. However, to take advantage of it, you have to learn how to control and measure the allocations. So let's dive right in. How Rust apps use memory. So let's start by reviewing Rust memory model really quickly. On the left side, you see what does Rust put on the stack. So they would be primitives, structs, so you have a couple different fields, but everything in the struct is fixed size. You also have fixed size arrays, and pointers references two things on the heap. So you have some stuff to work with, and then everything else that is dynamic is put on the heap, such as vex, which are lists and arrays, things like strings, and other more complex objects, as well as some miscellaneous things that we'll go over later. Now, one thing is that Rust does not have a garbage collector, so how does it manage memory? Now, this is a really important point for folks, those of you coming from dynamic languages, this is the biggest differentiator about Rust. What Rust promises is that even though it does not have a garbage collector, it will track how your data is used through a concept known as a lifetime. So it keeps track of when your data is created, when it is borrowed and used, and it tries to prevent unsafe use and sharing of your data, and it tracks when your data is no longer used, so it knows when to free it. This talk is not going to be about this, but this will be important to remember when we talk about allocating memory. So with that, let's dive into some basic data structures that are used a lot, and how much memory they actually represent. So we start off with strings and vex, so remember the vex is a list of a fixed type of item. So on the stack, they use on 64-bit architectures like the x86 and modern ARM machines, they use 24 bytes. The pointer would be 8 bytes, and then there is one field for capacity, which is how much, how many items this data server can hold, like that is how many characters for a string or how many items for a vex. And then there's a length, which is how many items it actually has right now. So these are both Grobo data structures. Now, then the pointer would point to an area in the heap that actually holds the items or the characters. Now, the complement to vex and strings are slice pointers and string slices. Both of these are immutable. They are the same as the previous ones, except they are 16 bytes because you have one pointer and you have a length, because these things cannot grow. So they just point at a location and they tell you how many items they were sent. Now let's look at a more complex data structure, the hash map. So for a hash map, this is a bit more complex. A hash map can be implemented using buckets. So all of your items are hashed into a fixed number of buckets. Each bucket in turn can have one or more items in it when there is a collision. So what we notice here is that the list of buckets is basically a vex of a bucket and in each slot a bucket, that points to a bucket, and in each bucket entry would be your key and value pairs for the items. And if my hash map is a string key and a string index, that means that I'm storing as the key and value entry in each bucket slot, I'm using 24 bytes for the key, for the key string struct, and 24 bytes for the value string struct so that added together gives you 48. And now assuming that the bucket, let's assume that each bucket has like one entry for the case where there are no collisions, then if you add in, you know, the pointer for the bucket, which is basically the bucket itself is like a vex, right? It's like a list. So that pointer actually takes up another 24 bytes, meaning the overhead is up to 72 bytes per entry. So one thing you need to be very, very aware of is that for more complex data structures, you have these nesting of pointers that can create a nontrivial amount of metadata. Now, if your items are large, then maybe it is not a problem. But if your items are small, then you might want to be careful and think about that. Now, where could you be allocating memory in your apps? So we'll go over each of these items in detail later. But starting in no random order, one item that could take up a lot of memory is sterilization. It creates a lot of temporary objects. In other words, one is when you use traits, trait objects, we'll go over that in a minute. And anytime you use clone and data structures and so forth. Now, you could look at your app and go through and find out, look at these users, but maybe a better way is to benchmark. So there's two ways you could benchmark your apps. So one is to do dynamic memory analysis. This looks at starting from t0, the time when you start your app, what is being allocated, how much, and it will track this over time. And it would figure out, hey, what is the, even if you allocate memory and free it, where are we allocating and freeing memory to most? Where is, what's the memory churn? So that's a dynamic benchmarking. There's also static memory analysis, static heap analysis. This you might be more familiar with if you come from a GC language like Java and so forth, where basically you could have an analyzer that walks your heap and figures out, hey, for a given point in time, what memory is being used up? What is using up the most memory? Now, there are tools for doing dynamic analysis. One is called heap track, another one is called the hat. And I'll go over examples of the hat. Static heap analysis is a bit more difficult in Rust. We can get overall memory usage pretty easily using something like general control, and we can actually diff memory usage. You can also profile data structure using something about deep size, but there isn't really anything comprehensive like you have with the JVM, but I'll show you some stuff you can use for that. So remember in Rust that usually the more things that you type, like box, the more you allocate. So that's just a fun hit to remember. So let's now let's go over some potentially, some potential users of memory and how we can help producing. So the first thing is look at your method signatures. What are we passing in? For example, do you see function signatures like this, where I pass in a vec of string? So this is quite common, you know, you're to process some string lists, right? So at the first appearance, you might like, okay, that's, you know, a nice signature, but there's two problems. When you ask the caller to pass in a vec of strings, you're basically forcing them to allocate twice, right? Once for the back and once for each string. Instead, if we're able to change a signature to point at a string, string slices, which is the second signature there, where we have this, you know, ampersand and string, this gives the caller two chances to avoid allocation. One is that they can point at existing strings instead of allocating a new string, that saves a whole bunch of memory. And the second one is that they can pass in a string slice instead of a slice of strings instead of a back, right? And if you want even more flexibility, you can change the signature to pass in an iterator, which gives you the chance to for them to pass in even non list data structures is anything that can provide an iterator for even more flexibility. So that gives you flexibility and gives you a way to avoid allocations. So, the next thing is that we can try to flatten our data structures, like vec of string, back, back. And I'm not going to go over all of these, but there's a bunch of crates that will help you there, such as nested, that will save you a lot of storage if you're trying to have a bunch of lists of strings, for example. And there's a whole bunch of crates that can help you with strings that are basically in lineable, where when you have strings below a certain size, they will be on the stack instead of the heap, as well as things like small, small back. So there are, once we're doing smaller data structures. And I did a test using a repo that you can feel free to visit where I show that by using nested instead of vec, you can save, you know, like say 25% total memory allocated. So another area is by reducing clones. You might notice many of you are writing code using async. This is a really popular feature of Rust now. You can write code that, you know, forks off work, and you can do an await to wait for it, which is great. You might find yourself, however, having to clone a lot of data structures when you're calling your async functions and async closures, because the data that is passed into async, because it's a future and could run on another thread, you know, used to be thread safe. One, there's some quick tips. One is to consider using arc instead of clone. And this is something that makes sense, especially for things like lists, things that where you could pass in a lot of items, clone will usually do a deep clone. What has to do deep clone is to clone every item. So that could be quite expensive. Using arc does cause you an atomic, you know, couple of atomic operations, and it saves you a lot of memory. Another idea is to use something like an actor pattern. This is where you try to keep your state local instead of passing your state around. And so you keep your data structures within each actor or equivalently within each thread, and you use channels to communicate. And you pass, you know, small messages and events only. So that's a pattern that can help and it has other benefits as well. Finally, we consider using something like tau. For example, if you want to escape strings, such as you want to, like for URLs or something else, where a lot of times the string does not change, but sometimes you need to create a new copy. Well, instead of creating a new copy every time, you can just copy only on write. Well, so how slow is arc really, right? Like in case you're worried about using arc instead of cloning, well, if the data is any same concise, it's, you know, it is fast, almost always faster actually. But basically arc is just an atomic increment on the clone and an atomic document on drop. And roughly on an x86, they estimate that this is between 30 and 120 nanoseconds, depending on which level of cache. It might be faster on other hardware. So now here's another area where you might be using memory is you might find that you have a signature like this, where you're processing some item. And, you know, you want to pass different implementations of traits, right? So you make your signature have this dyne keyword for dyne my trait, right? And now in order for you to pass that function, usually you need to box it, which means you need to allocate some heat memory for that. Unfortunately, that means that every time you're calling this method, you're, you're doing this allocation, which is not the fastest thing, right? So especially in a hot loop, one trait, one crate, sorry, that you can use that helps a lot is called enum dispatch, which is really great. What it does is if all of your trait implementations are within your control, you can make it an enum. And all of your implementations of my behavior, this case are in this enum, my behavior enum. What we do is that we would annotate it with enum dispatch, and enum dispatch will will magically it will tie in with the trait, and it will actually make it so that your enum will implement the trait methods, if all of the variants of the enums also implemented. So basically, you can change the signature here to process my behavior enum, and you can still call my method on it, on all the variants that pass in. This, this is a tremendous performance boost, and it reduces allocations too. So this is really, really great. I love it. And I use it in one crate of mine. So another area where we could allocate a lot is our what's serialization, right? So let's look at a quick sample, 30 JSON. We have to serialize some more JSON to an intermediate value type, this 30 JSON value thing. This is quite common for serialization libraries. And then it has to do another step. It has to take this intermediate IR and create a, you know, say struct or something, right? So one way that you could go with this is to use, there are some faster crates such as JSON Rust, where the intermediate representations are more efficient, like JSON Rust has a short value type, where short strings are on the stack. So this makes it faster, use less memory. You can also go to binary protocols, although many of them have the same problem. They need to translate to some intermediate layer or something, but some of them can translate directly to, you know, say a struct or something like that. However, I think the best strategy is just to avoid serialization altogether. No serialization. So what does this mean, Evan? This is what you would ask. What does this mean? What do we mean is using something like flat buffers, you might have heard of it, Captain Proto, Pachi Aero. It does take some work to actually create these formats. But what has usually meant is that there is no deserialization, meaning once I create a flat buffer, I can send it over the wire. When I get it, you know, I can actually examine a flat buffer directly from the network buffers and extract values out without creating another without translating it, deserializing it to my final form. So this is really fast. Usually you can do no copy or no deserialization. This is really, really good and I highly recommend it. So just an example from processing JSON that, sorry, that in this case, using my laptop and again, the comparison is available in this repo that I have. And I used DHAT for heap profiling. What we find is that using JSON Rust, where it would reduce the maximum heap used. So DHAT actually measures how much heap is used at the point where the heap is the largest in your application runtime. And we can see that it is quite a bit faster too. It's like maybe 30-some percent, you know, one-third faster because it has to allocate less. And again, this is the technique of using deserialization where it uses a stack value, a short string. And if you used no deserialization, it would be much faster than that. However, for some reason, the total allocations, you know, does not go down. This is like basically alloc-free, alloc-free, so on. And just to show you, I think this is good to show you what the DHAT output looks like. It basically gives you the top nodes at a certain time, but you can look at what are the top allocators for all time. And it will tell you, it will give you a stack trace. Here we can see that the stock trace, you can easily trace it to certain JSON when, you know, basically creating JSON objects that is using up, that is 90% of the allocations. And it will tell you things like what is the average size of the allocations and the lifetimes that on average is 73 microseconds. So it gives you a lot of really useful memory-profiling information. Okay. And just really quickly, we'll talk about a few extra memory allocation topics. In Rust, you can switch to memory allocator. There are two popular alternatives to the standard allocator. One is JML, which came from, well, it originally came from BSD, sorry. But it was popularized by Facebook and was created for reducing fragmentation and concurrency. It does have a bit of overhead in terms of memory used, but it is faster than the standard allocator. Another one to check out is called VML from Microsoft. It is designed to be a small, secure replacement for malloc. And in practice, I do believe it is also faster. So you can check that out. And I have a benchmark that shows that sometimes it is faster. Okay. Finally, for certain use cases, you can use Bump Arena allocators. And usually this is when, for special cases where, you know, let's say you want to sandbox some memory for a part of your app, you know, for queries in a database or a certain namespace is that kind of thing, you can just take this allocator memory by bumping pointer. And then you can free it all at once. So for these things, there is a crate called Bumpalow, which is really great. And that can help sometimes when you want to control memory use. Finally, you might be like, Evan, so reducing heap allocations is great, but I want to actually make my binary smaller. You should check out the cargo bloat crates. This will analyze your Rust binaries and figure out where is your space being used. But there are tons of ways that you can actually reduce, you can get down to really, really small, like well below megabyte binaries. You can check out this. And by the way, the slides will be shared. And you can actually click, you should be able to click on them. There's a URL. So there is a blog that gives many things for reducing the size of your binary, including stripping, reducing the debug things, optimizing for size instead of speed. And if you really want to remove the standard library, that's a way you can get down to extremely small like C sized things, but be warned that that has a lot of tradeoffs. And I'm not quite sure if they are worth it, but it depends on the use case. So thank you very much. You can feel free to reach out to me on Instagram, sorry, Twitter, Instagram, GitHub, et cetera.