 Hi folks, welcome back to another Rust Stream. This one is a little different from others I've done and I've been doing some streams that are different lately. And it's partially because I'm trying to find new ground to cover that's important. And partially because not everything fits into the sort of crust of rust structure where I sort of take one topic and then explain how it works. In particular, what I wanted to go over today is I have a crate that I take dependencies on and it annoys me every time because it builds slowly and I want to make it build faster. And this I think is a problem that a lot of people end up having with Rust Code where it doesn't build as fast as maybe they expect from some other languages or they're not even used to having compiled languages and they're like, oh man, this compilation thing is really slow and they want to figure out what they can do about it. Now Fast and the Lime has a great article in this that goes through a lot of the different steps that you can take and which ones work and which ones do not. I highly recommend reading it if you haven't already. My take here is a little bit different than what this article goes after. So my goal isn't so much to make it faster for me to build, but more what can I do about the source of the crate that will make it faster to build for everybody. There's not always something you can do. It might be that the outcome of today's video is we can't fix this. Like the compiler needs to get better or the underlying frameworks we use need to get better or support for procedural macros need to get better or something along those lines that there is the potential that the outcome of this video has no change. But even so, hopefully it'll be useful to see the process that I'm going through to try to figure out what is the underlying, sort of where is the underlying slowness coming from. What we will probably do is do some of the sort of standard tricks just because if a lot of your build time is taken up by let's say linking, it becomes a little harder to profile the time that comes not from the linker. So we might do some of the tricks that make it faster just for me just in order to make it easier for us to figure out where the remainder of the time goes. So the crate in particular that I'm thinking about and that I want to optimize is cargo itself. And I apologize for this being bright. Let me go ahead and fix that for you all before you start screaming at me. Dark default. There we go. These are also, let me make all of these dark too before we even get there. So cargo itself, most people are used to using cargo as a binary, right? So you use it to build your projects, to build your binaries, to publish your code, whatever. But cargo also is released as a library and the library API is basically permanently unstable like every release of Rust is a new 0.x release for cargo, which means it's considered backwards incompatible. So it's not super easy to work against. The goal in general is most things you should be able to do through the stable binary interface as opposed to the library interface. But every now and again, you really do need to drop to the library interface of cargo, especially if you're trying to do things like implement a cargo sub command, which I think I'm gonna do an impel Rust stream on at some point, but for right now, it doesn't really matter what we're gonna use it for. In fact, what I'm gonna set up is basically a crate that uses cargo, but uses cargo to do basically nothing. All I want is for that crate to build faster. Because as part of building that, it has to build cargo. And cargo has a bunch of dependencies and also cargo itself takes a while to compile as we'll see. And so I want to see whether there's anything at all we can do here. And that might mean improving some of the underlying, some of the underlying dependencies of cargo so that those work better, or so that they compile faster. Now, there are a bunch of things that we're not necessarily going to be getting deep into as part of this, but I'll talk about this as we get to. So let's just get, oh man, my stream title is wrong. All right, let me fix that real quick. Just cause it's gonna be annoying. Update titles, amazing. Making Rust crates compile faster, no description. Or I guess we can put the main description. Where's my tweet? And for those looking at the recording, I know, I know. It's okay, it's okay, shh, it's okay. Great, channel title's updated. I don't know if this will actually update the stream itself, but hey, at least we tried. Okay, so let's just start out by doing a cargo new, let's make it a binary. We're gonna call it cargo nothing, nothing. Because that's funny. All cargo subcommands are named cargo dash something. So when you run like cargo space nothing, what it does is it looks in your path for a binary called cargo dash nothing and runs that instead. So there's very little actual magic to it. And what we're gonna do is we're gonna have cargo nothing take dependency on cargo. I forget what the latest cargo is. Yeah, zero six three, that makes sense. Zero six three. And then I'm actually gonna do a little thing here and make a cargo config.table. So when you run cargo on your project, it also loads this configuration file from your current directory, every parent directory of that directory and your home directory. And this is a place where you can put configuration for cargo itself. So if you put it on your home directory, it's gonna apply to any project you build with cargo. If you put it in a particular project directory it'll only apply to that project. And if you put it at the appropriate place in the file system hierarchy then it will affect anything under but not anything above. And the reason I'm gonna do this is mainly because I want build jobs equals htop. Cause my computer is also encoding the stream so I wanted to not take all of my cores. Let's go with something like 12. So this is how many cores cargo is allowed to use for builds. So I'm gonna set jobs 12. We're gonna put some other things in here eventually. The other thing I'm gonna do is I'm gonna stick in the sparse registries. Sparse registry equals true, the unstable feature. The reason for this is because it's just faster and I want to help test it. There was a call for testing for this in case you haven't seen it. This is basically a feature that changes the way the cargo interacts with the crates.io registry so that instead of like get cloning the whole thing or a get pull, it just fetches this directly over individual HTTP requests and it's way, way, way faster. And of course, because that is an unstable feature we also have to do rustup override set nightly. So now we have a nightly tool chain and let's start out by doing a cargo build. We're gonna use the new timings flag that stabilized in Rust 1.61, I wanna say and just do a build, see what happens. And let's see, oh, my computer's very busy. Very, very busy computer. And hopefully this doesn't mess up the stream too much. It shouldn't, given that I said build the jobs but you know, who knows. Yes, you see this seems to be one of the crates that's certainly slowing things down. You see it spends a lot of time compiling Tommel edit. It's spending a bunch of time building cargo itself and this is why I really want to improve the build time of this. Because again, notice, we built, we just added the dependency. We didn't do, we didn't add any code that actually invoked cargo. All we did was add these dependencies and it adds so much to the build time and that makes me sad. Come on, cargo. You can do it. Nice. Okay, so you see it saved a timing report to here. So let's see what that looks like. So this file, if you haven't used this before, it's a very handy tool for figuring out where your build time goes. So when you run cargo dash dash timings, it produces this HTML file and it gives you this waterfall diagram of what has to be built. Because remember, in order to build this sort of nested dependency tree, cargo needs to walk the tree from the roots, from the things that don't have any dependencies and then build sort of going down so that all of the dependencies of a thing are built before the thing itself. This is why at the root tier, you see things like libc, which don't have any dependencies of their own and then further down, you can see other things that can't be built until those things have been built. And you see most things compile very quickly. I2A 0.2 seconds, great. Notice that this is not building in release mode even. Reject syntax is pretty early in the tree and does take a while to build, but it looks like there aren't too many things that actually depend on it. I also suspect it's seen a lot of optimization in this regard. You'll notice also that these have, you see there's sort of a left and a right half of each bar. So the blue part, the left part is the time it takes to build the metadata about the crate. So anything that's needed by a dependency, but not necessarily all of the optimized code and stuff. So once the blue part of the bar has been done, then dependence keys can start getting built, but you can't actually link the final binary artifacts until the purple bar has ended as well. But you can at least start compiling the next crate because all the type information and stuff is ready. It's cargo dash dash timing. It's a flag for cargo build. So it's stabilized in cargo itself, I think in Rust 161. So we see that as we go down here, we're pushed further and further right because these things have to wait for all the earlier things to build. And this is one of the reasons why trying to keep a small dependency tree can help, but especially one that's not very deep is because you don't have to wait quite as long until you can even start your build. As you see, like lib SSH2, lib get two, certainly end up pushing things further right here. I hope it'd be nice if we could try to optimize some of these or reduce the depth here. I think that's probably too large of an effort for us to take on right now. Let's see here. So certed derive takes a while to build. And that doesn't surprise me a great amount. So certed derive is basically a giant proc macro that needs to know like how to parse the Rust code that it needs to generate derives from and then has to have all of the code for doing those derivations. Open SSL took longer than I would have thought maybe because it runs bind gen. Little unclear. Certainly that's a pretty big push to the left. And you see nothing else builds at the same time. In fact, I'm gonna scroll down a little. So you see you get this plot at the bottom that shows you where your CPU time was spent. And you see initially we're doing like 100% of the work that we can. We're just fully saturating the cores and that's fantastic. But then we get to a point where there's only like a very small number of dependencies that we actually need in order to build the things that come after it. But they end up being a sort of choke point like a bottleneck for the build where we can't do any more work because we need to wait for that thing to finish. And you see a couple of those, right? So the first one over here is, it might be clap or it might be certed derive. It's a little unclear. But even if you look at like the certed derive in the open SSL here, they dwarf in comparison to Tommel edit and to cargo itself, right? And that's unfortunate. Notice also that cargo, generating the metadata for cargo takes a long time and building the sort of final artifacts at the end takes a long time. And that's why this builds end up taking over a minute for a from scratch build, even though it had plenty of cores is because for a bunch of the time it can't do anything useful. And so I think what we're gonna focus on this time is to figure out what can we do about Tommel edit and what can we do about cargo? Is there a way that we can reduce the amount of work that's required to build these? Cause that would be fantastic. And you see it tells us a little bit here the bottom in sort of table form. What are the worst offenders here? So regex syntax, Tommel edit, cargo itself, clap and regex. So there's nothing terribly surprising here. This build script taking eight seconds seems unfortunate. This seems like it's building libssh2 from scratch. So this suggests that if we actually had libssh, so maybe it's actually building it from source like a vendor copy, which suggests that I don't have libssh2 installed locally. So if I did, that might go away. And in fact, let's go and see whether that's the case already, libssh2. I do have libssh2. Interesting. It's unclear then why this runs for so long. It sounds like it doesn't pick up my, oh I, you know, I think I know why this is. If we go over here and we look at this libssh2, we go to the repo. It's in the documentation, I think. This I only know because I've run into this in the past, which is in order to get libssh2 to pick up the system installed version of libssh. You use, I'm looking at the build RS. Aha. We need to set libssh2 sysuse package config to one. So that's one of the things we're gonna try is we're gonna, the next time we build it to sort of diff the output of these is we're gonna run with that equals one and see if that gets rid of libssh from there. The other thing that might hit us here is the amount of time we spend doing linking. So linking is sort of separate from Russi, like Russi is not a linker. And by default it'll just use your system linker. But there are linkers that are often a fair amount faster than that. So for example, there's the mold linker, which is fairly new development. And mold is significantly faster, especially for larger link tasks. And so we can try to use mold as well and just see whether that shaves a little bit of time off. And again, this doesn't help everyone, right? This only helps for my build, but just to sort of see the kind of tricks we can play with over here. So I think there's an instruction over here. Yep. And so let's go back here to cargo nothing and edit our cargo config file and set linker equals clang. And I think these days it's actually LD path. And let's see what happens if we build a new timings. Well, this is clearly cheating because I'm not doing a from scratch build, which suggests maybe it didn't pick up my rust flags. Am I misremembering? I think you can set rust flags and build. Yeah. That's interesting. Oh, I think I also have, let me see if I still have that file here. Oh, I have, let's just, and you cargo target there. This is working around a particular setting I have locally. So I have the environment variable cargo targeter set to a different directory that happens to be on like a different drive. And that means that I don't get a target directory in the current directory that holds the output artifacts of the build, which is very handy because it means that the target is shared across anything I build with cargo, but it is a little annoying in cases like this where sometimes I just wanna blow away the target directory to do a from scratch build. Like in this case, right? I want to see what the entire build stage looks like now that I made these changes. And yeah, there are other ways you can run things with mold and that's fine. And I believe there's a way for you to check whether mold was actually used. Just a double check that we didn't do anything silly. So I'm gonna go ahead and use this command. Once this build finishes. So let's do this target debug cargo, nothing. Nope, it build with new, which means it didn't pick up my rust flags, which is a little interesting. Why? Why, oh, why, oh, why, oh, why? I mean, I guess I can try to do it the way that they want me to. Oh, did it actually give me a warning? Unuse config, you build dot linker. Oh, all right, well, let's see if I do it again. It looks like it picked up the rust flags now because see it decided to rebuild all of the dependencies as well, which it only does that if it detects that like something changed that affects everything that I build, such as rust flags changed, right? Because if rust flags change, that means you can't reuse any of the past artifacts you did because there might be an option that changes the code generation behavior of Rusty itself. Let's see, we're still waiting on cargo and toml edit. And notice that changing the linker is first of all only gonna improve things for you, but also it's not gonna affect most of this build time. The linking only really happens in a meaningful way at the end when we build the binary artifacts that then get linked together. All of the intermediate steps of, you know, doing things like type checking or monomorphization, like all the stuff that Rusty does, those won't be affected by this and they'll still be as slow as they were. All right, let's do our little quick check. Why did it yell at me again? I don't think it yelled at me again. I have mold installed, right? I don't have mold installed. Okay, great. Good job, John. No wonder it couldn't. All right, let's try it again. Build. Oh, I didn't, it's claiming I didn't change anything, which I suppose is true. How about now? Build, build, build, build, build, build, build, build. So maybe now you get some sense of why I really want cargo to build faster. Because when you're working on a sub-command, like very often you don't have to go through this, right? You're not gonna do a from-scratch build. But anytime like some dependency gets updated, it has to do this whole waterfall because the problem is cargo is the only dependency we took here. So that means that if anything changes in my list of dependencies, cargo has to be rebuilt. So even with, you know, incremental builds, and then I mean both like Rusty incremental, but also I have built it before and I'm now building it a second time. For both of them, it has to rebuild cargo and cargo is slow to rebuild. And so I have to go through this process. And that is unfortunate. Yeah, same with CI jobs is the same, right? Where it has to build everything from scratch. I worry less about CI builds because it's time that I'm not waiting. The problem is when it's time where I'm, I don't understand. Why? All right, fine. I'm gonna, instead of mucking around with this, I'm gonna just do mold dash run. Build, build, build, build, build, build, build, build, build. It shouldn't need the full path to mold because at least LD path should accept a relative path. And it wasn't warning me about it. Do-do-do-do-do-do-do, do-do-do-do-do-do-do. At least now you get to experience the pain of why I want to do this. Come now. Come now, cargo. Come now, cargo. Come on. Come on, you can do it. You can do it. You can do it. Come on. Come on. Yeah. All right. All right, it built with mold. Fantastic. All right, let's see whether that actually made a difference. So if we go back here and let's go to the top, which has a sort of summary, total build time 69.9, total build time with mold and with libsshfix, 68.8, right? Which is sort of as is expected. Like if we go down here and look now, you'll see first libssh has disappeared from this list. And that's because of that environment variable we set. So by setting this, we're telling libssh2, the crate, to try to use the system libssh2 instead of building from source. This can make a huge difference if you have any dependency that's like a native dependency. For libssh2, you have to set this environment variable. For most other projects like libgit2 or OpenSSL, it will always try to use your system library first. So you don't generally need to do this. But if something in your dependency graph has this kind of like opt in to use the system one instead of the vendor version, then that can save a decent amount of build time. In this case, it didn't. And the reason for that is I have enough cores. And the libssh2 build can happen in parallel with other things. So reducing its time doesn't actually reduce the longest time, right? So imagine that it had to build like libssh2 at the same time as certi. Like it just so happens, the dependency graph worked out this way. If certi takes longer to build than libssh2, even if I make libssh2 finished building faster, the things after it still need to wait for certi. So I haven't actually saved any overall build time. I've saved build cycles, which is helpful if you're more core constrained than I am. But that's sort of all there is. And when it comes to linking, the linking should show up in mainly at the end, which... Zoom out. What's funny is you can see our binary here at the very end. See, that right over here is our binary. And it looked like it shaved 0.04 seconds. And again, it's because this is tiny. Like we're linking a tiny binary, so linking isn't really the problem. We end up saving slightly more than that, right? Any build script, for example, also has to be linked along the way. So it's not, and same thing with anything that uses native libraries along the way and actually builds them from source, that also has to be linked. So it's not quite that small, but in practice, the link step is probably not your biggest concern unless you have a very hefty binary, right? That makes much less of a difference unless you have either a very slow linker, like an old GCC linker, or if you have a very large binary, or a binary that transitively brings in a lot of code, like something that links with a lot of things. And you see, if we compare the build graphs, they're like basically the same. There's not a meaningful difference between these. The only way in which you can look at the difference is you can actually see the difference for libs as H2, which is right here around the, it needs to zero adjust a little better. But see this peak over here and compare it to the peak over here. See how it's like missing a little bit? That's libs as H2 that we cut out. It doesn't have to be compiled anymore. So that's nice, but it doesn't really help us here, right? Okay, so now that we have discovered some of the things that we can do and why they don't really help, let's try to figure out what actually goes wrong. So let me close these and I'm gonna close that one and close cargo. And we'll go back to, where do we wanna start here? Okay, let's start out with cargo bloat. So cargo bloat is a tool that is pretty handy. You'll see what happens when I run it. So what it produces is in your output binary, which symbols take up the most amount of instructions. So this isn't how much code went into it. This is in your final binary, where do the bytes come from? So this helps tell you if your binary is much larger than you would have expected, what is the cause of that? In our case though, if we look at target debug cargo nothing, you see it's actually a very small binary. And you might, if you're coming from sea land, you might be like, it's not small enough, but that's not really what I'm after here, right? Like this binary does nothing. It's this large, mostly because of like debug symbols and also probably partially, cause we didn't optimize, right? So we didn't build with dash-release. So we can do that. We can do a dash-release build. It's just gonna be slower. We could do it with timings. I don't wanna do that right now because I care a little bit less about binary size in this particular example. What I more wanted to show you was that there aren't that many symbols in here. You'll see most of the symbols in here are fairly tiny, like 19K for parsing out debug info. And this is for stuff like printing back traces. And again, it's because our binary doesn't do anything. It doesn't link with any symbols from cargo, which means dead code elimination takes care of just like not including those in the final binary. If I change source main here and did something like config is cargo config new, and I think there's also a config.configure default and there's a configure which takes, let's do zero, false, none, false, false, false, none, no flags. This is what I mean by the cargo library interface is not friendly to newcomers. There's a lot of stuff there. But if I now do a build with cargo bloat, you see now it starts to tell me about, there's some cargo bits in here, there's some Toml edit bits in here. But that mostly tells me what code did I actually end up generating that has to run that didn't get removed by dead code elimination, which doesn't tell me why the Toml edit create itself or cargo itself took a long time to build. It just tells me how much stuff from those dependencies ended up in my binary. And that said, there is some interesting stuff in here. It seems like the formatting for inline tables from Toml edit is actually kind of large. Same with deserialization from cargo configs. And this makes sense, right? Like what we said was construct a cargo config, which means reading the config files from disk and then deserializing them into an in-memory format. So that code obviously needs to make it into the binary. Finalize table, so this is also Toml parsing. Hash Brown calls, that's fine. So nothing that's super interesting. But if I look at this binary now, it's 264 megabytes. Remember, it was 3.9 megabytes. And this is just because previously, Rust could get rid of everything from cargo and all of its dependencies because we didn't actually link against anything. Whereas now that we do have some code from cargo we need to bring in, suddenly all of these little tendrils of dependencies mean that we pull in a lot of code into our binary that wasn't needed before. You'll see there's like here decompression, but none of these still are very large. The big thing here is probably gonna be debug symbols. So if I strip this binary, it's now 1.6 megabytes. So strip here just removes debug symbols. So again, this is why when people complain about Rust binaries being large, usually it's debug symbols. It's not always, but again, here debug symbols made a huge difference because again, all those dependencies that go, all those call chains that come from our little config default ends up touching a lot of code. And for every piece of code that could possibly be executed for every instruction that ends up getting generated, we need the appropriate debug symbols, which means that so many debug symbols end up getting pulled in. Of course, the downside of stripping out the debug symbols is if I now run cargo bloat, it tells, it can't tell me anything because it doesn't know any symbols, which is unfortunate. So let me go ahead and remove that again and then rerun the build, rerun the build. Does it not, oh, it's cause cargo, this is maybe interesting for some of you. Cargo actually hard links, it's binaries. So if you look at the hard link count for this file, it's two. So removing this file and then rerunning cargo, it just causes cargo to recreate a hard link. It doesn't actually rebuild the file. I forget where the original is. It's not that one, it's in, I wanna say depths, cargo nothing. Yeah, it's that file. So if I remove that file and I remove this, then I think cargo will rebuild or I may just have broken cargo's expectations. No, great. So now if I look at it, it's big again. So people strip binaries because sometimes you're constrained for how large the file is. Again, this doesn't really help us though. So let's try turning to a different tool. So there's another tool that, so cargo bloat looks at what's taking up space in your executable. There's one called cargo LVM lines, which looks at the amount of LVM IR. So this is the intermediate representation that Rust C passes to LVM for building to figure out what is generating the most amount of work for LVM to then do code generation four. And if we go back to our build times here, this is sort of the boundary. It's not quite, but sort of the boundary here between the blue and the purple, where the purple is, here's all the stuff that Rust knows about this program. Give that sort of, it's sort of a miniature version of Rust or a simplified version of Rust or a slightly fancy version of assembly, depending on how you wanna look at it. The cargo then gives to LVM and says now give me machine code for this. And that time is gonna be LVM doing things like optimizations. And a lot of time can go in there, especially if cargo ends up passing a lot of this intermediate representation code to LVM, then it's giving LVM more work to do. And so what cargo LVM lines does is it tries to figure out what is the origin, the source for a lot of that IR. So if I run LVM lines, it's gonna tell me, let's do something like head and 20. Well, that's interesting. So a lot of LVM from freeing, from dropping configs, which is interesting. The copies here by the way is, so imagine you have a generic data structure, like think a vector. Remember vector is generic over the T that's contained. And what ends up happening is if you have say a vector of U32s and you have a vector of Booleans, then cargo will generate and you're using both. Cargo or Rust rather will generate code for each one individually so they can be optimized according to what the inner type is. But that means that let's take VEC push. You're actually gonna have two copies of VEC push in your final binary, one for when you're pushing a U32 into a VEC U32 and one for when you're pushing a Boole into a VEC Boole. And copies here is telling us how many copies are there of this particular generic invocation. It's interesting the drop shows up a lot here. Although one thing to keep in mind here with LVM lines is this is, in fact, it's not that surprising if you think about it. This is saying how many lines of LVM is Rust C giving to LVM when building this crate? And that's very little, right? If you look at main, like it's just the IR we generate for this code, which is very, very little. So this one is gonna be more useful to run on the dependency that we think is slow than it is to run on this crate. Cause again, if we go back to our build information stuff, the actual time it took to build our binary is very little including all the code gen and linking and stuff. So what we really wanna do is run cargo LVM lines on cargo or on Toml edit to see why they are generating so much work for LVM. So we'll do that in a second. I wanna see if there's anything else. I wanna talk about before I get into that. I'm gonna talk about one other trick first that I recently learned about. So there's this unstable feature in Rust C called share generics. And I think eventually the hope is to stabilize this. But what this feature or this flag to Rust C does is imagine that you have two crates in your dependency graph and they both depend on hash brown. So the implementation of the hash table. Both of them are constructing, let's say a hash table from string to U32. What's gonna happen with monomorphization is that Rust will monomorphize separately in each crate. So monomorphization happens in the consumer not in the vendor, right? So it's not the hash brown crate that has to be built for every possible combination of type parameters for its types. Because it doesn't possibly know all of the ways it might be instantiated. So instead it's when foo builds or first creates a hash map with string and bool or when bar first creates that type then it will be built the monomorphization, the copy paste and then build of that type will happen in the context of that crate. But this is kind of wasteful, right? If you have foo and bar and they both instantiate a hash map string bool then why are we doing that monomorphization twice? Why are we doing cogen for it twice once in each crate when we could do it once and then share it between them? So that's the idea for shared generics is to try to identify opportunities for us to share that monomorphization so we don't end up with multiple copies of the instantiation of generic types and methods and functions when they're actually multiple copies of that same set of type parameters. I just wanna see whether it matters here and the way we're gonna see whether that matters is we're gonna go into our, no we're not even gonna do that, we're gonna go here and we're gonna set rust flags equals dash C share generics. And we're gonna do cargo build slash timings. So that's gonna build everything again. It might make no difference, right? So this is gonna make a difference for only for the types that happen to share monomorphizations, which might be very few. It might be that the way the cargo dependency graph works out, there's not actually that much sharing that's possible. It's also a nightly feature so who knows how well it'll actually work in the end. But that's why we wanna try it out. And the hope of course being that if this speeds things up that is eventually something that's gonna land in Rust C itself, you don't need the nightly flag for it because it should be a transparent optimization to us as the consumers of the Rust compiler. It doesn't, like we just expect that it's gonna avoid doing unnecessary work multiple times, certainly still takes a while. I think it saved about two seconds. Okay, so build time went from 68.8 to 66.9. Hard to say how much of that is just variance, like just noise, but if we scroll down, let's see if there's anything useful in here. If we go down to, remember also this gets distorted a little bit by the fact that builds get to happen in parallel. So even if it could be that it made many things faster but it didn't make them, it made the wrong things faster. Like things that already were being overshadowed by something else that took longer. So let's look specifically a Tamil edit, for example. So it used to take 23.26 and now it took 22.57. Cargo from 40.3 to 39.23. Okay, so very small improvements but improvements nonetheless. And so, you know, this seems like it's a worthwhile optimization, like it does reduce compile times. And especially if it does this across the board, you know, it adds up to seconds, but it's not a magic bullet. It doesn't just fix this problem for any everything. And the reasoning for that is it's true when something shares one of more visations, this saves work, but that's not always the case, right? In fact, quite often they won't be the same, right? If I make a hash map string bool, you make a hash map string U32, there is no sharing that can happen. And so this feature effectively does nothing. And there's a cost to it too, right? That this means that more sophisticated analysis has to happen in the compiler to figure out where the sharing is possible. So it might be that it even slows down some builds, right? That there's no free lunch here. Did it change the binary size? I'm guessing probably not. Yeah, because again, remember there's very little code we actually end up pulling into our binary. Dead code elimination is gonna get rid of most of it. So this would only, this is not just is there any sharing in the entire dependency graph. This is, is there any sharing in the code that gets generated for creating a cargo config, configuring it and dropping it? And in practice, it's probably gonna be very little sharing there. And now I need to remove this and depth cargo nothing, this one. Okay, so not, not very many helpful answers here really so far. Most of it seems to be telling us that we need to be looking at cargo and looking at Tamil edit because that's where the source of the problems is. And looking at the current crate isn't gonna tell us that much because the current crate builds pretty quickly. In fact, you see with shared generics building our binary took longer probably because it has to do this extra searching and the base build is so fast. Okay, so let's then switch over to cargo and see what we can do over there. And I'm gonna go ahead and copy dash R my, it's already a dot cargo here. No, cargo nothing cargo over here. Just so that I get these same things. And then I'm gonna, I'm actually just gonna ignore all of these. I don't really care about using mold, right? It didn't really make much of a difference in practice. Shared generics didn't really make much of a difference in practice here. Libs is H only fixes, you know, a bug arguably in the build process and it didn't even speed up the build process by that much. So I don't care about it too much. I will keep the resetting the cargo targeter just so I have the target directory in the current directory so I can blow it away more easily. So let's then look at, I wanna cargo build dash dash lib. I don't actually care about building the cargo binary because that doesn't get built when I use cargo as a dependency and it's gonna save a bunch of build time and linker time, not to mention because the cargo binaries are actually decently large. What else do I wanted? Let's just kick off that build first and foremost. And the reason I kick off this build is because I'm, I don't actually care about the dependencies here anymore. Notice that all of the settings that we changed when we were dealing with cargo nothing were like configuration settings for how cargo builds things, how Rusty build things and therefore they affect the entire dependency tree. But looking at this, right? Realistically, all of the time that we wanna get rid of is spent in cargo itself and building this crate. So building all the dependencies is something that, we can just build them once and then just tweak with cargo itself. So I think that's what we wanna do here. Tomolettet has the same property, right, of we should be able to improve Tomolettet as well. I do have a sneaky suspicion that what actually happens here is that cargo ends up recompiling or the Tomolettet ends up generating lots of IR that cargo then ends up also emitting probably because of things like driving serialization and running like asserted deserialize and serialize on generic types that are defined in cargo itself. And so really what we're seeing here is the IR generated by Tomolettet is very large and therefore building both Tomolettet and cargo takes a long time. But we'll see that in a second. All right, so let's now go ahead and do LVM lines and see what LVM line says. Unclear why this decided it wanted to build Crate's I.O. instead. But yeah, see how long building just cargo takes? So the hope here, right, is that this command is gonna output what is the IR or what is the source function or type that ends up generating the most IR for LVM to then have to work on. And the hope is that that's gonna point us to something where maybe we can just make it not be generic, right? So imagine you have something that's fairly long, like it generates a lot of IR and it's also generic. It might be the bunch of the code is actually shareable between different types and doesn't need to be generic. Let's see here. There are a lot of copies of iterator tri-fold and option map and vex extend. That's interesting. Iterator tri-fold and option map make a lot of sense because they take closures. And so basically every time they're invoked they're gonna be a different type. So you see there are a lot of copies of them and they end up accounting for a lot of the lines in the upper binary. What I should have done is actually do this so that I can walk through the file instead. CERDI-JSON deserialize. There are a lot of copies for. And deserialize is generic over the reader. Is it also generic over the T? I think so. 31 copies of that. So this one's interesting. So notice here, cargo's toml type. It has a deserializing a cargo toml project which I think is just the toml in cargo toml itself. There are only three copies of it but it accounts for like half a percent of all of the IR that's given to LVM in the first place. So this one is not even many copies is that the single implementation is very large. What else do we have down here? CERDI-ignored map access. Yeah, so a lot of this like VEC stuff is hard to get much out of. B-tree, this is inserts into B-trees of various different types. So one way that you can speed this up right is to have more maps be the same type but you might not even have the option to do anything about that. Yeah, so you see there are a lot of copies of these B-tree types. Same with hash maps, multiple copies of hash map being brought in but we see even more here of the deserialized implementation for the various cargo toml types. And if we look at cargo utile toml manifest, for example, utile toml mod, you see that it actually just derives deserialized. And it derives deserialized which means that what, oh, and you see toml project here being the large type it also derives deserialized. And so what this suggests is that the derivation of deserialized for toml project generates a lot of code. And if you look at it, that's not surprising, right? Look at how many fields there are. And each of these fields are themselves pretty large. They all contain this maybe workspace which has toml workspace field which has defined their generics over T string or VEC and it in turn contains the toml value. I wonder whether if we got rid of this whether we build a lot faster because toml value is a fairly complex type, I think because the toml value is, you know, it can be one of the base types but it can also be a table and a table is a map from string to value. So it ends up having to generate all this like nested deserialized code. But certainly I can see how this ends up generating a lot of IR. It might be that we could write a more sort of manual implementation of deserialized here that ended up being more efficient but that might be premature. Commenting this out might be really interesting. I wonder if I go here and I get rid of this, I get rid of this and toml dependency, detail toml dependency. Okay, so just to see, if I now run Nometa, I don't even know if it'll build like this. Of course it didn't, 15, 60, this is a very hacky thing to do, right? I'm just trying to see whether that generates a workspace root config which is not deserialized. So what if we just here pass in like none? Well, let us do that, great. What else do we get? 2057, so custom metadata went down here. Can that be none too maybe? And 2217, none. I just wanna see whether the suspicion that it's actually the deserialized for toml value that ends up being very large, whether that's the case. Because if so, it might be that the thing we actually need to do is optimize how much IR is generated for the deserialized implementation of toml value which would be in the toml edit crate. You can run LVM lines with rust flags. Ooh. Hey, we have David in chat who made LVM lines. Let's do that, so, and this. So the symbol mangling version is rust has a, an old symbol mangling scheme. Symbol mangling being how do you take the actual name of a function? So, or a type or whatever that's generic and has instantiations of different types or closures and whatnot. And turn it into like an ASCII string that you can embed in, embed in debug symbols and the like. And rust used to have one that was like kind of like C++ and then just had some kind of random ways in which it tried to shove all the rust looking types into that format. But that means that it's a little bit lossy in types, in cases where you just don't get as much information, you don't get as rich information out. So there's now a standardization of the rust symbol mangling scheme called the zero because the previous one didn't even have a version and it is not lossy in the same way. It includes more of that information in a way that you can reliably get it back out. And so I think what David is saying is that by doing this, you actually get every unique instantiation of the generic functions and you actually get to see where they're coming from or which types are being instantiated with. Now, of course I made the mistake of not timing the build without this step. So it could very well be that this was slower or faster and we won't actually know. What we will know though is whether we still see the same amount of IR show up for the DC realisation stuff now than we did in the past. And if it's less IR, in general that's gonna mean shorter build times. And I mean, we can easily stash build retry. So that's not a great concern. Let's see what we get out the other side here. Certainly still slow to build. But again, like cargo is large, it's a large project and it's not as though we expected to compile in a second, right? Like there is a lot of code and so it should take a while to build. It's more that we're trying to figure out should it take 40 seconds to build in debug mode to which I feel like the answer should be no. But if the answer is no, where is that time spent and how can we remove it? It certainly seems like, you know the DC relies of total value is not the if it makes a difference, it doesn't make a huge difference, right? Because otherwise this build would have finished already. So it's still taking a long time to build. The question is, you know, can we reduce it by a few seconds? Cause if so, that's still a big win. All right, let's see what we get out the other side here. Oh yeah, we get way better info now. That's nice. All right, so, Tommel manifests to real manifest. So this is actually the serialization. Let's see, deserialization of inline tables of maps. Yeah, I think this is all of the, this is the Tommel deserialization protocol. So it's not, at this point it's not the code that gets generated by serialize and deserialize. It's actually the, so the way CERTI works sort of is that there are two parts to serialization and deserialization. There's the type part and there's the format part. So the type part is a type implements serialize or deserialize. And what that means is it presents a standard interface for walking that type to discover what is inside of it. So when you derive serialize and deserialize, you don't really see this. But in general, what this means is you create this visitor thing. And what a visitor is is a structure that implements a particular set of CERTI APIs that most people don't really think about that allow someone to say, can this type be turned into a string? Can this type be created from a string? Same for integers, bools, the other primitive types. But also can it be turned into a map? Can it be read from a map? Can it be turned into an array? Can it be made from an array? But all of that is independent of the format. So that part is just about the type and figuring out what can you construct this type from? And what can you turn this type into in terms of other well-known types? And then there's a format side of this. So this is the serializer and deserializer. Notice the R at the end there. And those implement formats. So an example of this would be CERTI JSON or TAML. And what the format side of things does is it says, give me a string or give me bytes and I will use my knowledge of that format spec to turn it into calls to the type-based traits. So for example, if it finds a string, it's gonna call serialize string on the underlying type that you said you wanted to deserialize into. When it finds a number, it's gonna call serialize number. And the idea here being that if you have the implementation of all the types or the all of the things that a type can be turned into from and you have a thing that gives all of those input types, then you can map them together to get deserialization and serialization for any type for any format as long as they are compatible in terms of the base types that they use. So maps being a good example where if you have a struct in Rust, it can often be represented as a map. And that's how it will be represented in something like JSON. As what we're seeing here is the serialization ends up being sort of generic over two things. It ends up being generic over the format and the type. So the type here comes from the derived deserialize, derived serialize. And that's generated by the sort of derived crate. And that just generates that mechanism for walking something. So again, if you do something like derive, deserialize for a struct, what that's gonna do in the sort of derived crate is it's gonna walk the struct and construct it by reading keys from a map where the names of the fields are used as the keys in the map. That's the sort of the expectation, the contract that it implements. So it's gonna use that and then it's gonna use the toml format that comes from the toml edit crate. And so that describes how do you walk a toml file and produce things like maps and strings and numbers? And similarly, given the type that's walkable, how do you turn that into toml constructs like ASCII constructs and toml such as strings and numbers and booleans and maps? And so what we'll see here is toml project, which is a type and you see as deserialize, right? So this is the toml project type being used as, or it's deserialize implementation being given to a, where we have it given to a toml edit inline map access. So that's the format side of things. So this is basically creating a toml project from a toml table is the way to read this very complex type. That's what that turns into. And it makes a lot of sense, right? Like this is saying, basically this function that we see right here is all of the code that's necessary to turn a toml table into cargo's internal struct representation of that table specifically for the toml project type. And toml project is this thing, which has all of these, all the fields that you can possibly set for the project section of a cargo toml file. And it doesn't look like removing the toml value did a huge amount of stuff, but let's see if we can't, if I stash that change and then I run this again and I call that with meta or just remove the no meta because I wanna compare, I wanna see them sort of side by side and see how did we meaningfully reduce the amount of code that gets generated. And it is true, we will have reduced it somewhat because there's one fewer fields. So by necessity, there's less code to do in that parsing. The question is, how do we make it like a meaningful dent in the build time? And notice that even if we did, it's not clear this is a sustainable change, right? Because the metadata is there for a reason. This was more me on the suspicion that maybe the toml value type has a particularly large amount of IR that gets generated for it. And so by not including it in this type, we're saving a bunch of IR and therefore a bunch of work for cargo LVM. And what this is gonna show us is basically how much did we save? I suspect the hashes are actually gonna be different. But if I look at this file, actually maybe they're not. Let's see what happens if I do LVM lines.text and no meta. And I'm gonna zoom out a little here. I know you can't actually read the text, I just wanna be able to sort of skim the difference. So you see the IR for a bunch of things didn't change, which is sort of what we expected, right? Because even though we removed these fields from some of the structs, there are a bunch of structs that didn't change. And so we would sort of expect that they ended up generating the same amount of IR under the assumption that compilation is generally deterministic. You see some things moved places because they ended up using less time. And if you, I guess I can zoom in a little bit here. Somewhere like here maybe. So you'll see this first one went from 7,531 lines of IR to 7,499. So you know, we cut it by a small percentage. And this is specifically the turning a Tommel manifest into a real manifest because that needs to walk one fewer fields. Same with the code for parsing out a Tommel project almost certainly. Yeah, so Tommel project went from 6,005 to 6,003. So again, like 200 lines fewer of IR, not super impressive, but it doesn't make a small difference, right? Let's try to see, you know, if we actually did the cargo build here. So again, this is with the changes stashed. So I'm just gonna open them side by side and see, you know, in terms of the actual build time for cargo overall, did this make a difference? I suspect it's gonna be, you know, tiny, tiny, you know, point percentage. The meta value did not seem all that big. So the meta value isn't large as in like the type is not large, but that's not really what we're after. What we're after here is actually more of a, it's almost a little hidden from you, which is when that type says derive deserialize, that generates a huge function for you with lots of code that Rusty has to build. And so what we're doing by removing that field from our type, that means our deserialize doesn't have to call deserialize on that type, which means its IR doesn't need to be taken into account. It might not even need to be built at all. So let me do this. Let me open that report and then do another build with timings. So this is the time it takes to build cargo itself without our meta change. And it took, you know, what, 39.4 seconds to just build cargo. What's also interesting here is you'll see is that, you know, for the LVM code generation, it actually ends up being able to use, you know, four cores, which is nice. Whereas this metadata generation step that we do before we hand stuff to LVM, it can't even use one cord fully or one code generation unit fully, which is a little sad. Although that might just be because it's one crate. That might just be because it's one crate. All right, so this ran our new build and it went from 39.3 seconds to 32.45. Again, could be noise. That's not at all what I wanted to copy. And in fact, you know, we could run timing on this, but oh, that's because it just ran cargo. So if I run this, I'm going to have to do it again now. The other one compiled the dependencies as well. So it's an unfair comparison. So I stashed again and I'm going to run the build again. So this, it's not actually a seven second difference. Would merging certain to stud help? Merging certain to stud would not make a difference here because at least some of where this time is going is the serialize the walking code for the Toml project type that gets generated and the formatting, the code to handle the Toml spec, the Toml format. Those both have a bunch of IR and that's specific to the Toml project type which is in cargo and the Toml code which is in Toml edit. And neither of those are about Surty itself. Surty, you can think of more as the interface between the two. Okay, so removing that field, in fact, made the build slower, which basically means this is just noise. Like in practice, it had no actual difference. That's interesting. So that suggests indeed that the, there's not actually that much IR in meta or it might mean that there are still other places where Toml value ends up getting deserialized. And so we still have to compile the code for it. In fact, that code, I'll talk about that in a second. Yeah, the more I think about it, the more this makes sense because the deserialized code for Toml project, all it means to have this field is that we're gonna call deserialize on this value. And this is the call to then deserialize a Toml value. It's just a function call within that code. So it doesn't generate that much IR. The IR comes from the fact that Toml values implementation of deserialize still has to be compiled because it's still probably being used somewhere. And so that IR still gets generated. It's just separate from the implementation of this. So I think in practice, removing this is unlikely to make a difference. Can you replace the serialize and deserialize implementations with a dummy? Oh, certainly. Let's do, I don't think we even need serialize. So for Toml project, let's do impulse 30 deserialize for Toml project. Toml project. See what happens if we do that. That doesn't work because cargo really doesn't like if we have warnings. IR is the intermediate representation, which is basically the language that Rust uses to talk to LVM. So it's not Rust. In fact, it's unrelated to Rust. It's a language that LVM accepts as input into its code generation pipeline. That it's sort of kind of like assembly, but it's a little higher level. It allows more annotations that LVM can use for think like optimization later. Okay, so that did cut the build time, not by much, but by a second. And in fact, if we run LVM lines again, in fact, let's just look at the LVM lines we had last time. So Toml project was one. Toml manifest is another. So let's do the same thing for Toml manifest. And the reason we're doing this is mostly just to convince ourselves that this is the reason why. Or rather to sort of see how much could we possibly improve on this. And I think there was one more in there that sort of showed up, which was inheritable fields and intermediate dependency. So let's go ahead and find intermediate dependency and inheritable fields. See what that does. Right. And that means these fields have to go away because these SERTI instructions are basically a part of the derived macro. So because we remove derive deserialize, these won't be parsed anymore, although they will be picked up by serialize. So it's only for this one, which only derived deserialize, it did not derive serialize, then there's now nothing to pick up this attribute. And same with this one. And I probably missed one more. 296, no 269. Right. That's intermediate dependency, which is actually generic. So we need to say it is generic over NEP, where P implements. No, in fact, for NEP for now, that's fine. See what that gives us. And what this will do is basically give us a sort of upper bound on how much time we could save in the build. If we spent lots of times like hyper-optimizing this. And you know, okay, so this was getting rid of Tommel manifest, Tommel project, and inheritable fields and intermediate dependency. And you see a cut build time from 32 seconds to 29.8 seconds. That's a decent cut, but there's still a bunch of time being spent somewhere. So let's go ahead and see what remains. No op. And it might be at this point that it's no longer deserialization code, right? Certainly it seems like deserialization was a decent part of the IR, right? We cut three seconds. We cut almost 10% off of build time, right? From 30, cut three seconds, 10%. So that's pretty good, but it's not, you know, it doesn't mean the car goes fast to build. So there's still gonna be something left that we wanna explore. And we could probably have noticed this if we looked at the old lines output, right? Because if you add up all of these together, so 0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1%. 1.1, 1.2, 1.3, 1.4, 1.5, 1.6. We don't quite get there. So it seems like it has an outsized impact because it means there's less code for Rusty to do and there's less code for LVM to do. So it's not just a direct translation to the percentage of IR. All right, so let's look at this know-up and see what's left now. So there's still deserialization for, so this is sort of JSON for diagnostic server. That's interesting. This deserializing CLI unstable. And the reason for that is because if we look at that's under feature, I wanna say source cargo core features CLI unstable. See if I'm telling the truth. Yeah, so this CLI unstable struct is constructed by this macro and this macro basically creates a field for each one of these. So again, the IR is quite simply a function of how many fields are there because the larger the struct is, the more IR that CERDA needs to generate for this type in order for it to derive serialize because the derived serialize basically needs to walk every field. So it's a pretty direct translation. But then you see now we start getting into things like cargo clean generate some IR but none of these are a lot of IR which sort of suggests that, okay, some of this is we're giving LVM a lot of work to do. If we look back at this, one thing you'll see, I think, scroll down enough here is that when we cut down this time, a decent amount of the time we cut is actually from LVM code gen. It's not all from LVM code gen but we're giving LVM less work to do and therefore the LVM part takes less time. But the cargo stuff actually takes more time now which is a little interesting. It goes from what? For maybe let's say five seconds to closer to seven. But LVM takes a lot less. So reducing the IR here has helped LVM a lot but it hasn't made the Rust C part any faster which sort of makes sense, right? It suggests that the time that Rust C is spending is on things like type checking which LVM doesn't care about. And by removing these serialized implementations, we've removed a lot of sort of trivial Rust code that the Rust C probably went through pretty quickly anyway. You can imagine something, you can imagine a trait bound or something that is really complicated for Rust C to compute but once it's computed and type checked it, it doesn't generate basically any IR for LVM. And so the purple will be very small but the blue will be very large. And what we did here by removing the LVM IR stuff, the very large chunks of LVM IR is we gave LVM less work to do but we don't actually know where Rust C is spending its time which is what we need in order to cut that blue bar. And again, there's still a lot of work here that LVM is clearly doing that we could probably optimize too but let's switch gears a little bit and look at what can we do about that blue part, about the Rust part of compilation and not the code gen part of compilation. One of the reasons why it's important to focus on that blue part, even though the purple part is large is remember in that large build tree that we have in that water flow diagram in the very beginning, remember I said that you can't start building dependencies until the blue part is done, right? So for example, cargo depends on Tamil edit. Cargo couldn't start building until the blue part of Tamil edit finished. And the reason for this, right, is because to build cargo, you don't need the code gen output from LVM because you're not linking yet but you do need Tamil edit to have gone through things like type checking because you need to know what types are available so that you can do type checking of cargo. So that's the distinction here to think about. And so I think that's the reason why we really want to shrink that blue part because it allows more things to start sooner. If imagine that the blue part of Tamil edit was like zero, then cargo would get to start, I don't know exactly how long this is but that's about what, eight seconds? Cargo would let it get to build eight seconds, start building eight seconds sooner, which is gonna cut the whole build time by eight seconds because it gets to do more things in parallel back here. So then the question becomes, okay, we've looked at what we can do about the purple part which is generate less IR. That's the main part of it. And cargo LVM lines is a great tool for doing that. What do we do about the blue part? So for the blue part, let's see, I talked about this, I've talked about this, talked about this. I'm gonna close that. So for the blue part, there's a different mechanism you can use. And that is Rust C itself has a, has a nightly flag called dash C self profile. And what that does is it runs the compiler in a mode where it records where it's spending its time. And it records it in such a way that you can then open it up and like visualizers of various kinds to look at that in a graphical way. So let's go look at that. And I'm gonna skip ahead a little bit and go over to Amos's great article and steal this. So dash C self profile is useful in and of itself but we want some additional information and this additional information is basically don't just record which function you're running, like inside of the compiler it has like a function that does monomorphization. And if it told us, I spend all my time in monomorphization wouldn't really tell us very much. We wanna know what it is monomorphizing. And so what this additional flag does, it says when you're self profiling then also include in addition to the default events also include the arguments for every function. And so what this is gonna enable us to do is figure out which call to monomorphize did you spend your time in. And you'll notice I run cargo rust C here and that's because I wanna build the current crate with these additional dash Z flags but I don't need to add it to the rust flags environment variable because that would build my entire dependency graph with self profile, which I don't care about. I just want it for this invocation. So let's on the nightly channel of cargo but this is the beta channel. Oh no, I didn't did I? So it's gonna have to compile all the dependencies again. What? Oh right. The dash dash is needed here because these are arguments to rust C. They're not arguments to cargo rust C. And the distinction is kind of annoying but you can imagine that the cargo rust C sub command took arguments and those arguments are not necessarily the same as the ones that rust C itself as the command line tool rust C accepts. And this dash dash lets us separate the things that cargo rust C should interpret as its arguments as opposed to the one it should pass in. And this is gonna have to build the dependency graph again and that's fine. I just wanna build dash dash lib. So that's an example. Dash dash lib here is an argument to cargo rust C it's not an argument to rust C. And so that's why this dash dash is needed to separate the two. Look how many times we've compiled the dependency graph of cargo. So someone said in chat, is the build time a function of the number of IR lines generated? So sort of, right? The more IR you give LVM to do cogen on the longer it's gonna take LVM to do that cogen because every IR line is gonna turn into somewhere between, you know, zero and N lines of actual machine code. Some might be optimized, right? But that does mean that every line has to be evaluated by LVM and that means work for LVM itself. But that's not where all of the time goes. So again, for example, rust C has to do a bunch of work that's independent of cogen and that's what we're gonna look at now. So, okay, it finished. And if I now look in the current directory, you'll see I have this file called cargo dash bunch of numbers dot mmprofdata. That's the file that includes all the profiling information that rust C just generated as part of building cargo. You'll see it's pretty large because it holds all of this information about rust C, what rust C was doing at every step. This file is in a format that can be a little annoying to parse out. And the mm here stands for measure me, which is a tool that was built specifically for rust C's self-profiling feature. You can't see this because I think it's behind my face, but it says under the about support crate for rust C's self-profiling feature. Measure me comes with a bunch of tools. You just do like cargo. It's not quite cargo install these, but it's like cargo install dash dash git, path to it dash dash brands branch and then the name of the tools. There are a couple of different tools that come here. There's summarize, which gives you a table of where rust C spent its time. There's stack collapse and flame graph would let you produce flame graphs for where most of the time was going. And then there's crocs. And crocs generates a file that you can then open in a profiling tool like the Chrome profiler to see like a timeline of where all the time went. And that's the one we want to use in this case because it gives us the nicest visualization. I happened to have prepared some of this earlier. So I have crocs installed and I'm just going to give it its file. I don't know. I hope it doesn't spit this to standard out. It did not. So it gave me a Chrome profiler.json file, which you see is 1.9 gigabytes large. I might regret this. Let's see whether I will regret this. Whether, for example, the stream ends up slow because I'm uploading too much at once. Open. I heard a few years ago, well, that's good. I heard a few years ago, the rust C gives very bad IR to LVM, which is why it's so slow. Do you know if that still is or ever was true? I haven't heard that. It might be true, but I don't know. Are these tools based off Inferno? Yeah, the Flamegraph tooling there is based off Inferno. The trace processor native binary is an accelerator. All right, sure. What's the worst that could happen? I'm just going to run some random scripts off of the internet. No one's ever been bitten by that before. Ah, see, it says the browser memory limit is two gigabytes per tab, but the file was 1.9 gigabytes. It totally is larger in memory. Does that mean it finished loading the trace? Do I have enough memory to do this? Is everything going to fall apart? I think I'm okay. Well, it's loading. I wonder how long this will take to load. There is, nice. Crocs does have a mechanism for reducing events that are shorter than a certain amount of time, which removes a lot of noise from this and would make the file significantly smaller, but we don't want to remove things. Trace loaded in 47 seconds, nice. Reload the UI and click yes. Yes. All right, let's see what we get. Computing Android startup metric. That doesn't sound like what I want. I think I saw a jank in there loading overview. This is going to crash my computer. Okay, so here's what we get is a visualization. This is a timeline visualization saying, the left-hand side is when did Rusty get invoked? And then this is a time diagram showing the entire time until Rusty returned and the entire sort of, it's not quite a stack trace, because these are events are emitted by Rusty itself. So this is when Rusty decided to emit an event as opposed to every possible stack frame, which would be a lot more noise. And you'll see at the top here, we have fairly nicely distinguished faces of generation. So first we have whatever's over here, configure and expand. Then we have analysis, which is type checking and borrow checking. Then we have generate create metadata, which is, it looks like exporting symbols and things like monomorphization. Then there's code generation. Then it's, oh, then there's the self-profiling itself. So self-profile enables a mode where Rusty has to do a lot more work in order to do the profiling. So this is showing us that I spent a bunch of time doing self-profiling and this is linking. And you see linking is a fairly short amount of time here. Can you open this in Chromium DevTools proper? You can, but if you open the Chrome DevTools, the performance page, it now tells you to use perfeto instead. Okay, so we're gonna ignore linking. We can ignore allocating query strings. If you see further down here, you see it also shows where these, like if we have, it's using multiple threads to do code generation and stuff. And it's just showing you what all those other threads are doing. But that's all as part of code generation. And I suppose we could zoom in here a little bit on the code generation, maybe. Well, let me do that. Yeah. And here this is all code generation for different crates. Like, which crate is this? I wonder. Arguments, well, that's unhelpful arguments. So this is all the code gen stuff, but that's the purple part, right? So code generation is not actually what we care about here. Instead, I think the bits that we actually want are, why will it not let me scroll? Our analysis and generating the metadata. Like that's where things get interesting. Why can I not scroll left and right? Which is what I really want to do. If I zoom in and out, it zooms in and out of the center. Ah, there we go. I cheated. Amazing. Oh yeah, so this actually ties back to another point that I should have raised earlier, which is one of the ways to improve your build time is to make sure there are fewer things that build before you. And one of the great ways to do that is to opt out of default feature flags and only enable the ones you need. Because in general, feature flags tend to enable more dependencies in like transitive dependencies. And so if you remove them, they don't have to be built before you. And therefore you might be built in parallel with them instead or they might not need to be built at all. So definitely reducing your use of feature flags or rather reducing which feature flags and thus which optional dependencies you actually end up taking can improve your build time a lot. I'm sure that applies to Cargo too. There are almost certainly some of the dependencies here that we could eliminate. And that might be worth digging into. Inlining the Toml edit dependency. That might help. I think it would be really interesting to do the same analysis to the Toml edit create to figure out where its time is going. My guesses, I still think that they're the same. I still think that what ends up making the blue part of Cargo long is in big part the thing that makes the Toml edit thing long. And I'm guessing it has to do monomorphization. I'm guessing that they both end up doing monomorphization over some type from Toml edit. Because remember monomorphizations happens in the crate that names the type basically. And shared generics wouldn't help here if they're naming two different types. But imagine that there's some function in Toml edit and let's say the most obvious example being two string. So imagine that Toml edit has a two string function and it's generic over which type it should turn into a Toml string. And the Toml edit create calls that function with one of its type. Let's say Toml value. That means it has to type check all of the like serialization code for two string for Toml value. And that's probably a bunch of code, which means a bunch of work. In the cargo crate, it calls two string with some cargo type, maybe Toml project. That means it's gonna generate a monomorphization of two string. And that monomorphization is gonna be a bunch of code that has to be type checked. And they're two copies, they're not shareable because they're not using the same type. They both end up generating lots of code that needs to be parsed out. And this actually gets back to one really neat trick for trying to reduce the cost of monomorphization in this way, which is non-generic inner functions. Let me see if I can find that inner functions. So I'll put this in chat. This came up back in 2020. And it's a great article on if you have a function where the function has to be generic, but a bunch of the code in the function does not actually care about those generics. One example being, if you want to have a nice outer interface, so you want to be generic over anything that can be turned into a path, but the body of your function only really operates on a path reference anyway. Then what you can do is you can have this inner function that is not generic that holds most of the body of your function. And then you have the generic part called the non-generic inner function. And that way, there's only gonna be one copy of this function everywhere. Not just in your crate, but in any dependent crate too, because this function is not generic. So only a single copy is generated and that one is reused everywhere. And there's gonna be lots of copies of this potentially, both in your crate and in consuming crates, but they will all be really short because realistically, they're just gonna do the sort of generic part and then they're gonna call the non-generic function and all of them share the same non-generic function. And therefore, you only end up doing the sort of expensive work once. And the generic, the non-expensive generic parts you do many times, but they're all different. And this is something that I suspect Tommel edit could benefit a lot from. So I'm guessing that they have a bunch of functions that are generic over either the type that implements serialize or deserialize or when you implement the serializer and deserializer traits, you're often generic over the read type or the write type. So the underlying thing you're gonna read from or write to, such as a file or a byte string or whatnot. And even though you're generic over both of them, chances are most of the actual code for producing and parsing Tommel doesn't need to be generic over those things. But if it isn't put in a non-generic inner function or some other non-generic helper function, then all of that code is gonna be done for each monomorphization, which means a lot of potentially wasted effort. And so we might be able to do some of that optimization. We might not actually do it today. I'm just pointing it out as one of the ways you can help reduce the compile time from this kind of generic stuff. Okay, so let's go back to this trace. So what do we have here? For analysis, we have type checking and type checking has collecting types, coherence checking, WF checking, WF. Whole file. Who knows? Item types checking. It's interesting that a lot of this time is spent in ops common for install and uninstall. Right, so you see here for check mod type, and also for this one, that's the primary argument. Cargo ops cargo add manifest. Maybe cargo add made, I'm just like looking for the big blocks. Okay, item bodies checking. So these things are looking at the sort of outer types. So this is function signatures and type signatures and stuff. And this is checking the bodies of methods to see that they type check internally, but also that they match the signature of the function. Oh, well formedness. That makes sense. That is a better interpretation of WF. So type check item bodies. Let's see, let's zoom into this a little bit. See what we can find. So this seems like it has lots and lots of small things in it. There's nothing here that strikes me as like a particularly wide frame. Although, you know, if you squint, like this one is kind of wide, parse links overrides, this one is wide. So that's cargo util Tom all, because of course it is. Two real manifest. Let's look at this two real util Tom all, two real manifest. So two real manifest is just a very long function. That's why it shows up so much. It has a very, very, very long function body. And therefore, you know, it's a bunch of work to deal with. Okay, but it doesn't look like type checking is the big offender here. You know, there are lots of functions in cargo that need to be type checked, but there's none that stand out as being particularly bad. So let's switch over to borrow checking. And again, this sort of ties back to, you know, how we talked about, you know, cargo is a big project. So we expect there to be a bunch of work. What we're looking for is particular offenders that we think we could improve on. So what do we have in here? Still nothing that looks very large. This one maybe. Hey, look, it's cargo util Tom all, two real manifest again. So borrow checking that function took 38 milliseconds. Right, so again, the time here very tiny. But of course, because there are lots of things that are tiny, they add up. What's also interesting is this appears to all be single threaded. You know, the code generation is multi-threaded, but the analysis and generating the metadata all appears to be happening on one thread, you know, on this thread two. The other threads look to just do code gen and LVM. So certainly one way to optimize this would be trying to make that happen in parallel. Although I'm sure that's easier said than done. No, that's not what I wanted to do. All right, so generating create metadata. What do we have over here? Come back. Optimized mirror, optimized mirror. These all again look like, you know, a small amount of work for every item. Nothing that pops out very much. Exported symbols, collect and partition, monomorphized items, monomorphization collector, graph walk. What do we have? Do we have anything here that looks very large? No, nothing that looks very large here either. Although what I wish I could do here is look for things that are kind of similar. Right, like, even that wouldn't really, well, that one looks large. What do we got over here? What is this big one? One to zoom in more over here and see what we can do. Yeah, it doesn't seem like there are any that really strike out. But one of the things that's challenging here, right, is that the arguments here are all generic. And so there's no distinction between types that are the same except for an inner type. Right, so an example here would be deserialize. Right, so every implementation of deserialize might be small, but it might be that the deserialize for any particular thing, or rather that collectively deserialize is the biggest offender. But this wouldn't tell us that because we can't at a glance tell, you know, this range of tiny dots are all related to deserialization. I do wonder what the color coding here is though. Doesn't look like it's related to the arguments. What would be nice, actually, is if the colors were like a hash of each prefix of the type or something. What's that thing over there? Hash map new. Okay, that one's pretty large. But nothing that really stands out. What about over here in partitioning? CGP partitioning, assert symbols are distinct. I wonder why that takes so long. So here's also where I mentioned earlier that the tools here aren't great. You know, as you see, some of the information is here. In fact, potentially most of the information is here. The problem is it's hard to generalize from this. And you'll notice that the reason why we get all these tiny slivers here is because these are many calls to the same thing but with different arguments. And if we hadn't included the self-profile flag that says include arguments, well, we wouldn't get as bigger boxes, but we wouldn't really know what the boxes were. We can try it. In fact, I think that's what the colors here are, is if we go back to the collector, for example. No, don't, that's not what I want. I think the colors here is the function. So this is optimized mirror. And this is the same color, so it's the same function, whereas green is resolve instance. So squinting at this, you know, it seems like there's about the same of orange and green but they're like interleaved a lot. And this is where a flame graph might actually be helpful. The challenge is it's really just gonna tell us the time was spent in optimized mirror, like a bunch of time was spent at optimized mirror and well, maybe it'll tell us the biggest offender with each one. Let's try it. I mean, it doesn't hurt to try. So let's go back to this and this is one thing I really, that really makes me sad about fish is that I don't immediately get my completions. So flame graph of cargo mm. This might generate a very large file. Graph help. So now we should have a, is there, yeah, rustc.svg. Let's see how big that file is, rustc.svg. Well, it's not very helpful. Thanks. That's great, thank you. Oh yeah, you're probably right. Maybe I can, oh you're right, I can just draw here. What am I doing? Good call. This file clearly removes all the information that we care about. So let's try to bring some of that back if we can. We might not be able to, which is sad. This is because the default, by default, flame graph will remove things that are particularly small. So a lot of the smaller frames here end up going away. Although I guess what I really wanna do is go over here to generate cargo metadata. You see some of these end up being so small that we don't really get much to go on. Like for optimized mirror, you know, it just tells us that's where time was spent. Same with here, you know, this is really, this ends up being a non-time view of this one. So you know, here where let's take monomorphization collector graph walk. If we go over here, it will tell us, so what was hard to tell here was like, how much is green versus how much is orange? And here it'll tell us, you know, you spent this much time in optimized mirror, you spent this much time in resolve instance, but that doesn't really help us, right? Mirror drops elaborated. It does tell us, you know, which subcalls, but it doesn't tell us which items. And part of the reason for that is each item is so small that you wouldn't easily figure out what the top, there's no obvious contender here because every instantiation of every type is considered different. So I think realistically, what this is telling us is, we would want some kind of like pass that we could do on the file that sort of turns types back into their generic form. Like we might have, you know, deserialize of standard IO file and deserialize of vacuate. And I want to turn them into deserialize of R so that they're combined in the output. And that way I should be more easily able to look at, okay, what are the general problem areas? Because this is relatively unhelpful. And unfortunately, there's also no easy way to do the sort of reverse graph, which is almost what we want here, which is show me what is the slowest type. And I want the slowest type across all of the passes that are run on it. That would be a really useful metric, but there's no real way to extract that information here. You know, if I go to the bottom, or if I go to here, you know, this is talking about the type. That's a bad example maybe. Let's take this one, right? So this is vector from iteration over package IDs with a filter. I want that type, tell me about the time spent across all passes on that type, because that should show which type is taking up collectively the most amount of time across all passes. Yeah, so I think what you'd want here is like, you wanna produce all the same data with all the arguments and then have a post-processing step that sort of reduces or combines types that are sort of similar or very similar, or things that you know you don't care about, like closures maybe, and just removes those types so that those strings compare the same, then collapse them and then permit the flame graph or this thing. And what you should get is larger chunks where each chunk sort of, you know, is lossy. It combines information from multiple actual concrete types, but it might be able to direct you more in terms of, you know, which type, which monomorphization should I be looking at? Yeah, so arguably one thing you could look at here, right, is go back to the old symbol mangling scheme, because that one was lossy. It wouldn't include all of the information necessary to reconstruct a type. And so that actually means it does some of that work for you, but in a relatively unpredictable way. And so it's a little unfortunate because I don't know that we're gonna get a lot further looking at just this data. And it's sad because that is sort of the extent of the tooling that we have today. There's not a lot more we can really do here to try to grab more data out of this. I wonder if there's a metrics. Well, this is unhelpful. Oh, I guess this is very like Android focused. Query with SQL. That's weird. Yeah, cause it's hard to tell from this, you know, is it just cargo is large? Or is it there's some particular offender here that we just can't easily spot from this? One option, right, is do random sampling, which is basically what I'm doing now, which is take the whole thing, just click random small things and see if you see patterns, like things that are often there. But they don't seem fairly distinct, which is interesting. Great dependency. Yeah, I don't see any like big obvious offenders here, which is kind of fascinating. The other way we could go about this is we could stop this one and we could do cargo rust C lib verbose and then we could take this command and we could do perf record. Why did it fail? Is it trying to tell me that I break the code somewhere? Oh, it's expecting to get something from the build script. This is very hard to parse through, but see how there's a cargo package version patch in here? So if I do end cargo package version patch equals zero, cargo package version patch. Oh, and it wants cargo package version major equals zero. Cargo package version minor equals 63. Nice, let's see what we get here. So what we get out of that is a perf data file. And then what I can do is perf script and let's see what happens. So what I'm doing here is rather than using the rust C profiling, which the developers working on rust C have done careful work to make sure it emits the right metrics and whatnot and just saying, I just wanna see where rust C the process is spending its time. And just sample it at random times, which is what perf record does and then I want a flame graph of where it spent its time. And I should have stored the perf script into a file, but I did not because I'm a full perf SVG. See what we got here. Right, so this is much noisier, right? Because it's not this neat, these are the metrics that we collected. But over here, we have collector collect items. So what's over here? Rust C interface passes analysis. You know, this is much noisier data, but let's see rust C privacy analysis. Okay, so privacy analysis, where is it going? It's visiting expression. So this is where it'd be nice to have the arguments to the function, but it looks like a decent amount of time is spent in just analyzing the visibility of different types. Once we got, if we go back up here, we have encode create root, encode metadata, monomorphization. So this is more useful if we wanted to optimize rust C itself. If we wanted to say, okay, I want the, you know, encode metadata implementation function to be faster. Where is it spending its time across all possible types? You know, it's like spending a bunch of its time in whatever this closure is. So maybe we could like tweak that ever so slightly and that would have a huge impact maybe. Interesting, but I don't think this is gonna help us much because it doesn't record the argument. So all we know is a bunch of time is being spent in this part of rust C without knowing why it's spent there in rust C. But so you sort of see my original point here, which is all of these metrics are kind of noisy, kind of hard to know what to act on. And you end up basically having to sort of read the code and look at, okay, what are patterns that this code is doing that we may be able to do better for? You know, same thing for, if we wanted to do the same kind of analysis for Toml edit, we would end up in a similar position where we would have to figure out where is it spending its time. Now that said, you know, Toml edit is generating a lot of IR, which is weird because Toml edit, you know, isn't, it's not instantiated for any particular type. Maybe it's the IR for all of the standard library types that ends up being large. That could very well be, which is not very satisfying. I was hoping we'd be able to do more than this. Oh, I quit the process, didn't I? That's too bad. I was hoping we'd been able to actually, I mean, we did cut it a little bit, right? So we did discover that the DC realized implementation for these types does generate a lot of IR. And it might be that rather than deriving DC realized for these very large types, maybe we can write more efficient manuals DC realized for them. Or maybe this is an indication that in surded derive which is the thing that generates that code, maybe surded derived could be made to generate some code that is basically generates less IR in and of itself. And that would be a benefit for anything that derives serialized and DC realized. I also think it would probably be worthwhile to do the same kind of analysis for Toml edit. It could very well be that it too uses surded derive. In fact, we should be able to see that here. In fact, it uses, yeah, it needs, surdy needs surded derive. I'm guessing the Toml edit is the place to focus this attention. And in particular, I'm guessing the Toml edit could be made to have non-generic inner functions that eliminated some of this duplication across the different monomorphizations which would then lead to improvements to cargo. It would lead to improvements to Toml edit itself. And maybe we could even insert a derive have it generate non-generic inner functions for anything that derives and thereby get this compile type benefit for anything that uses it. Can you profile the code by running the tests? So profiling the code wouldn't help us at all here because we don't actually care about the runtime of the code. What we care about is the time that the compiler spends building the code which is just like unrelated to how long the code takes to run. You can write very short code that generates basically no IR but it takes a very long time to run, right? The trivial example being an infinite for loop. Takes forever to run but it compiles real fast. Mini-SERDA might be another way to go about this. So Mini-SERDA if you haven't seen is, let me dig it up here. Mini-SERDA, so Mini-SERDA is sort of a an attempt at doing solving a very particular subset of use cases for SERDA in a way that's better for that subset of use cases. It's not a general purpose substitute for SERDA. It's as the read me says is more of a proof of concept and it specifically works well for, well, first of all, it's designed to be really fast but also for data where you don't want monomorphization or you don't care about it. And serialization is weird here, right? So I mentioned how you might want non-generic inner functions and that is true for the purposes of you want to avoid monomorphization and so you want that shared code so that it compiles faster but the flip side of that coin is if you don't generate specialized implementations for each type, that means you're giving the compiler less of an ability to specialize the optimizations to the particular interaction. So for example, let's say you're gonna serialize into and your serialization code is generic over anything that implements write. Well, writing to a file and writing to a vector of U8s are very different. It might totally be that you can generate a much more efficient code for when you know you're serializing into a VEQ8 but if you have a non-generic inner function that just like always serializes in the same way to like an intermediate buffer or something, you would lose out on some of those optimizations or conversely, you would end up always having to serialize into a VEQ8 even if the goal is to serialize into a file because you're trying to keep it non-generic where you can. And so as you see, there are a bunch of caveats basically for cases where you would want to use MiniCerti or rather where you would not want to use it. So for example, it's only really built for one format which is JSON. And so there's an interesting observation here of maybe Tomolettet relying on CertiDirive and the general Certi framework here is actually coming at a pretty high cost, especially for something like cargo. That isn't to say, you know, now Certi's the solution here necessarily but more that looking at a graph like this and looking at where it seems to be spending a bunch of its time, that this might be a pretty ripe area for optimization. And crucially, it might have a serious impact on the build time of cargo itself. I want to talk about one more thing at the end here. I know we didn't get to actually make cargo that much faster, but the goal here was more so than the journey than the destination, right? To give you some idea of what tools you have available and what tools you don't have available and the kind of techniques that you can use. So one more thing I wanted to talk about is, I did not open a tab for this apparently, is what. And what is not necessarily something you want to use but it's an interesting idea to be aware of. So the observation with what is that procedural macros are really powerful, right? So procedural macros to do a very, very quick recap is rust programs that take rust source code as input and produce rust sort codes as output. They're basically fancy versions of macros. They get to be much more powerful in what they get to do. The challenge with procedural macros is in part that anyone who wants to use a procedural macro, you have to compile the procedural macro and then run it and then compile the output, which means that if we look at this this sort of step diagram of build, CERDI-DRIVE basically produces a program that parses rust programs, which means that it has to be built to completion, the final binary has to be linked and it has to be done before you can even start looking at the source code of these downstream things. You can almost think of it as sort of it ends up being almost like a build script that has to run. And what WANT is trying to do is to rather than compile the procedural macro in any consumer of that crate, you compile it when the procedural macro crate is built and then you stick basically the compiled output, the compiled program into the crate. So the idea being that here, basically when we build CERDI-DRIVE, we build the procedural macro as a binary and then any consumer of it doesn't need to, I'm trying to find the right way to frame this explanation. Actually, let's see if the WANT read me has a better way of phrasing this than I do. Yeah, so the idea here being that when I, in fact, does this happen at publish time or does this happen at build time? We have all downstream users of the macro from having to compile the macro logic or its dependencies themselves. Right, so I think the idea of WANT is that you actually do this compilation when you publish your crate. You compile all of the dependencies, you compile the program, the actual macro, you compile it to a WebAssembly binary. And the reason you compile it to WebAssembly is you can run WebAssembly on lots of different platforms without having to recompile it for each platform. So you compile it into WASM and then what you actually publish is that compiled WASM code and a very tiny procedural macro that has basically no dependencies that just takes as input the source code, runs it through WASM and then outputs the modified source code. And what that means is if you look at something like the build graph for cargo, you wouldn't need to build cert-a-derive or rather the build of cert-a-derive is just a copy of the WebAssembly and then running cert-a-derive would just run that WebAssembly. But you wouldn't need to compile any of cert-a-derive's dependencies because it wouldn't have any. It wouldn't have a dependency on something like a Rust source code parser like Syn and quote. Instead, it would be a dependency free crate whose only job is to run the WebAssembly binary. And that way, you know, you're gonna save a bunch of that compile time because you don't need to care about the bits that happened that there were dependencies of producing the WebAssembly because that happened on the developer's laptop when they published the crate instead. Does that distinction make sense? Now, what's nice about what is that you can opt into it for your procedural macro on behalf of all of your downstream users. The downside is that it's not a super well-supported use case. Like the tooling isn't really there for having a published time dependency. This is not something the cargo knows a lot about. And so it's a little clunky to get it up and running. It's a little clunky to be the publisher. But as a consumer, you don't really notice because all you're gonna notice is that your procedural macros compile instantly and don't take dependencies. They might run a little slower, but they're gonna compile much faster. And there's some questions like, okay, why is this safe? Why is this okay? The answer to that is partially that WebAssembly makes it safe, right? So when you run WebAssembly, you can sandbox it pretty easily. And so you just run the WebAssembly because all it gets is like a string as input and it produces a string as output. It doesn't need to have access to IO or anything like it. So you can completely isolate that procedural macro when it executes. And if you look at the readme here, you can see the sort of instructions for how you go about it. And as you see here, the procedural macro that you write is really just gonna include the bytes of the wasm file directly into itself. And the execution is run that WebAssembly. It provided this string. So it's a really cool way to try to reduce the procedural macro side of things. And in terms of bringing this back to compile time for something like cargo, the reason this would matter is, you see here that CERDI can't be compiled, or rather, TomoEdit can't be compiled until CERDI is compiled. CERDI can't be compiled until CERDI-derived is compiled. Compiling CERDI-derived takes five seconds. So if you cut that to nothing, you've already shifted all of the CERDI stuff left by five seconds, or to the next max of some parallel dependency. But CERDI-derived itself couldn't start until CIN was compiled. And CIN can't be compiled until QUOTE is compiled. And QUOTE couldn't be compiled until PROC-MACRO2 was compiled. And PROC-MACRO2 couldn't be compiled until PROC-MACRO2 build script was compiled, which means that there's this whole dependency chain that we need to wait for before we can even build CERDI and before we can build TomoEdit. And if all those dependencies were at published time instead, those will all go away. Those lines will move all the way to the left. And so we save a bunch of this basically wait time that is enforced by having to build the procedural macros in the consumer. Of course, that doesn't mean that you end up saving five seconds plus seven seconds plus a second or so. You don't end up saving actually all that time. You save some subset of it, because as we talked about, you still need to wait for all your other dependencies to build. So CERDI has dependencies on other things that started to arrive. So you'd still be subject to their sort of waterfall build time. But you should be able to shift things left in general. Okay. I think that's all I wanted to touch on today, which I know it's a slightly unsatisfactory ending that we didn't actually improve the compile time of Cargo. But hopefully this has given you an insight into how you would go about improving the compile time of Cargo. And I think realistically what's gonna happen is I'm gonna go do this investigation for TomoEdit, see whether I can do some improvements there to help myself for this pain that I've been feeling. I don't know that I'll do that on stream, but at least now you know all the same things I do about how you would go about doing that work. Okay. Are there any questions before we end? Like this was, I've been talking for a long time about all this stuff. And I'm sure people have questions about how do I do this for myself? And while I'm waiting for questions, I highly recommend reading this article like start to finish. It goes through a lot of the tricks that we talked about today. And some of the other ones that, even though they're not gonna apply to other people using your crate, they are gonna apply to you for your builds. And that might be all that matters to you. I'll also link, in the description of the video, I'll also link a couple of other articles that are useful that we've talked about today, some of the tools that we've talked about. There's also this article that I thought was really good about where Rusty spends its time. I'll stick it in chat too. This is a much longer article and it might not be immediately reusable. It's also a little older, but it's a great read on basically how to figure out where all this time is going in a very low level way. You should comment a PR if you end up doing something. Yeah, if I end up filing a PR that helps with this stuff, I'll make sure to tweet it out. I think many people are gonna be interested in it. Do you lose any compile time to monomorphization if you only implement a single concrete type? No, I mean, I wouldn't think of monomorphization as losing time. Monomorphization is just making sure that you have compiled every instantiation of a generic type that ends up getting used. So again, VecU32 and VecBool, monomorphization just means compile Vec32 and VecBool if they're both used. But if all you're using is VecU32, it will only compile VecU32. There isn't really an overhead to the act of monomorphization. Whether you have more than one or just one, it's just a matter of how many times over is that chunk of code going to be compiled? What do you think the story for build times in Rust will be like in two or three years? It's hard to say. I think given the graphs that we looked at today too, a big win is probably gonna come from paralyzing some of these stages, but paralyzing them is really hard because there's a lot of information exchange that has to happen. And so it kind of has to be fairly tightly coupled, but hopefully some of that is possible to strip apart. And I think procedural macros with an endeavor or something like what, it doesn't specifically have to be what, but sort of doing more work, publish time might be really nice and be a decent build time win. Incremental compilation is probably just gonna keep it getting better. LVM Code Gen gets better. There's also work on stuff like Crane Lift, which is basically a rewrite of a code generation engine that is tied much more closely to Rust and is written in Rust. So hopefully it might be able to be more safely made concurrent, for example. And there's some hope that that's gonna reduce code generation time. Things like moving to whether chalk, polonius, like being able to externalize some of these things might allow them to be optimized separately, which tends to make optimization a little easier. I don't know that I see amazing speed improvements coming out of nowhere. I think this is more of a gradual process of these things getting better over time. And there are some larger steps we could take, but they're also fairly complex. So I think it's gonna look a little bit like a sort of, more like a kind of log graph than it's gonna be a step function, but there are gonna be some steps in there. Now, I do think that we might see some decent wins from some sub steps, but there are not a lot of magic beans here. Do you think this is a stream you would revisit if the kinds of tools you're describing were built, such as aggregating calls by their arguments? Quite possibly. I do think that stuff like this tends to take a fair amount of time. Like building analytics tooling for development is just, it's a bunch of work. And it's a bunch of work that is sort of ungratifying to do until you're done. Like sitting down and being like, I'm gonna write a new analysis tool for visualizing, you know, Rust C build times is probably gonna start out as a hobby project. And that means it's gonna take a long time to get to something that's like useful for people to use. But that said, you know, once those tools do come out, I would totally start using them and that seems like a good opportunity to teach people about them. One thing I forgot to mention actually is one of the reasons cargo is a particular offender for build times is because cargo can't use cargo workspaces. So you can't currently split cargo into sub crates, which is one of the best techniques for improving build times. You can't do that with cargo because cargo is a sub module of Rust. So it's already a member of the Rust workspace. And so you, because you can't have nested workspaces in Rust, cargo is required to not be a workspace in and of itself, which is real shame because it means, you know, that cargo ends up being built as this giant blob and can't be broken up in a very meaningful sense. I do think that this is something, this is one of the reasons why the cargo maintainers really want nested workspaces to be a feature because they want to use it themselves. It just turns out it's fairly complicated to get all the nooks and crannies and corner cases figured out. But if anyone's looking for like a cargo project to pick up and are not afraid to get into a pretty deep process, like nested workspaces would be a huge win here. Is the release build time critical if the debug build time is fast? I think they sort of go hand in hand. Like the biggest difference for release builds is more in the code generation step. There are a couple more optimization phases, I think in mirror as well, like in the middle of representation, but a lot of it comes down to LVM doing a lot more work. So any improvements to Rust C are probably going to be relatively independent. I do think it's important for release builds to be faster too, but I don't think it's as important as getting debug builds down. Do you think the compiler could be smarter and automatically implement inner functions? Could be one of the, it could be that it already is kind of smart, but that it's sometimes complicated because let's take the one that we looked at earlier with read to string that takes a path. Imagine you didn't have this inner function and so you just wrote the function normally the way you would have and you just stick path.asref like in here, then the compiler would have to, or this is maybe a bad example, but one thing that will often happen is you're actually using the generic bit somewhere in the middle or maybe multiple times. And really what you could do if you, like as a human is you could take the parts that are non-generic and the sections that are non-generic and turn those into multiple inner generic functions and then interleave them with generic calls. I think it's gonna take a while for the compiler to get smart enough to do that. It would be really cool if it could, I don't know of a way for it to basically make functions for you. I put target dir on a RAM drive to avoid wear and tear but the stream made me realize I should benchmark it to see if they're performing reasons to do it or if caching cancels it out. I've found that sharing a target directory across all my builds is the best way to speed up my builds. Putting them on RAM drive is nice, but like so I have an NVME drive I'm not concerned about the speed of the drive. The drive is very fast, but I am concerned about whether it has to recompile or not and sharing a target dir helps with that. You think rust compile times are an issue with regards to rust adoption? I've heard this complaint several times from people who I'm pretty sure never used rust. Maybe I think that you have to be kind of short-sighted to decide that the reason we're not gonna use rust is because it's slower to build than language of choice. It could be among your reasons and it could even be the reason why you start looking at alternatives but I don't think it's a good reason in and of itself partially because it's a problem that's gonna get better over time and partially because if you say that's the only reason you're discounting the fact that there are other parts of rust or other features of rust that save developer time and so you kind of need to take that into account. You need to offset that in your estimation of how much time is actually being spent. It might be more annoying for developers to see slow builds even though they overall save time because they do less debugging or something like that. So there's a hard to measure developer frustration aspect to this but I don't think slow build times in and of themselves are a worthwhile soul reason to not choose rust. I don't know when CranLift is gonna be official to use for debug builds. Yeah, going through your favorite crates or your most used crates or even just the crates in your dependency graph that you take dependencies on and looking for dependencies that can be made optional and filing PRs with those projects saying, hey, I noticed you don't actually use this feature of this dependency. I got rid of it for you here as a PR. Can be really valuable because that's one of the ways where we collectively as a community end up shortening that waterfall. What about build time for proc macros? I remember those being potentially pretty slow. So I think the reason why proc macros are often slow is more because they end up taking a dependency on SIN and SIN itself is fairly involved, which I don't know that we're gonna get rid of SIN anytime soon. SIN is great for all sorts of reasons. I think the solution is more gonna come on something like publish side compilation of proc macros because that has some serious other benefits too like being able to isolate proc macros which is something that currently is a bit of an annoyance in Rust or a problem in Rust is that proc macros sort of get to do whatever they want. There's source-based code coverage that was really helpful to understand which code runs multiple times. I don't know that source-based code coverage is gonna help on Rust C. Or rather, I don't know that it would tell me a lot more than what the perf output tells me which is which functions end up being called a lot. That would help me maybe improve if I wanted to make a contribution to Rust C to improve the efficiency of a given function but it wouldn't tell me which parts of my code base like the thing I'm pointing Rust C at how can I improve that so that Rust C has to do less work. Compile times are usually long when you have a lot of proc macros. Yeah, I mean this is again one of the motivations for what is that you won't, as a user of proc macros, you won't have to compile the proc macros which is gonna save you a bunch of that time. That said, you're still gonna have to execute the proc macros and that can take a bunch of time too. So when we look at the build of cargo here, it could totally be that a bunch of this time is actually running the proc macros for say derived decerilize and derived serialize. That could be the came from Tomlet it too, quite possibly, and it's a little sad that this doesn't highlight time spent running proc macros because that could be a significant contributor here. And what certainly doesn't help with that, right? If I build my proc macros on publish time, you're still gonna have to run them because something has to expand that derived decerilize into the appropriate impulse. In fact, if anything, it makes it slower because I'm not gonna get a native binary, I'm gonna get a wasm cogen blob that's gonna run and wasm generally runs slower than a native binary. Are there benchmarks to compare build times between different programming languages? I haven't seen any, one of the reasons why it's difficult is because it's hard to get the same program if it's not trivial. And if it's trivial, it's not interesting. So it's tough to get a sort of apples to apples comparison here. Compiling rust binaries and a raspberry pi makes you feel like, yeah, I believe that. Do proc macros still build themselves in release mode or the overall build is release even though they aren't part of the final code? I believe proc macros will build themselves in release and release mode. I think there was a proposal to have proc macros always build and release, even in debug mode because you only need to compile the proc macro once. So for incremental compilation, having the proc macro be a release build means that your incremental build is gonna be faster because the proc macro will run faster. But I don't know where that actually landed. Does partial specialization improve overall times? I don't think it does. Are you worried about feature creep or the speed of adding features to rust? I've heard this from some people. No, I don't think I've really been, I think rust has been pretty good actually about preventing feature creep. I don't even see us adding that many features. If anything, rust has been adding features fairly conservatively. Now that said, I do think sometimes there's a stabilization rush, which is a little different from scope creep, right? So if you look at something like generic associated types or async await, when you feel like you're getting really close, it's tempting to say, it's good enough, let's just get it out just in the final stretch and that can sometimes be a mistake. But I don't think it's like a general problem. In fact, I think the Libs team, the compiler team, the API team have been really good about making sure that they really think things through and figure out that this is the right decision overall and the right balance because sometimes you won't know about the problems until something gets widely used and it won't get widely used until you stabilized it. So it's very easy with hindsight to say this was the bad decision or this was a good decision but having the foresight to know that this is going to turn out to be good or bad is often really difficult. Let's see. Okay, I think that's the end of the questions too. We're getting up on the three hour mark so I think that's a good time to stop. Hopefully that was useful. Hopefully I see a bunch of PRs to all these projects now fixing up features for dependencies, making dependencies optional, removing dependencies and maybe someone will even take on the work of trying to get Tamil edit for example to either have fewer things that need proc macros to execute or have more efficient implementations of serialized to deserialized or something else that I didn't even think of that would be really cool to see. Thanks for watching and I will make sure we take another video of this if there's more stuff to follow up on. Next dream is probably going to be on either build scripts in FFI or implementing a cargo sub command because I think it's super interesting and it's slightly different if there's more of an impulse rust than a crust of rust or something completely different if I change my mind. I'll see you next time.