 So, we are going to start. This is like our talk of the day. Joseph is going to introduce himself first and then his tool called Rasperezi. So, a large applause, a warm applause to him. Thank you very much. All right. So, today I'm going to talk about how to actually be like a gigantic global call group of Chris Raiow. So, my name is Joseph Hedrup and this is work that I've done together with Giorgio Sculsius and Moris Bellar. So, they're also sitting here in the audience. You can identify them by the shirt that I'm wearing. So, I'm also wondering what does actually Rasperezi stand for? It comes from the word precision in German, which is in English precision. And I'm myself not German, but Moris is German, so he helped me give this project a wonderful name. So, to talk a little bit about myself, when I'm not here actually at Fostem, I'm a PhD student at TU Delft, which is a technical university in the Netherlands. It's actually in a small Q-town. And what I actually did to a day-to-day basis is that I work on dependency management problems. So, one of the projects of course is Rasperezi. But I'm also into understanding practices and the way we can actually do dependency management more pleasant than better for developers. And before I was a PhD student, I actually worked on pull request prioritization and that was a lead developer start-up. So, before we dive into actually the idea of building a gigantic call graph, I was thinking we can actually revisit what dependency checkers or dependency manager, sorry, how dependency checker works. Just a question, how many of you do you use a dependency checker in general? Any hands? Okay, so there are quite a few. But I'll just give a bit introduction on how it actually works. So, over here I have a cargo.tomel file with some dependency specified. And over here you can see that some of them actually have a version range to it by the tilde or carat symbol. And when we, let's say like want to do some type of security checking or find out whether a dependency is vulnerable or has some license conflict, we first resolve the version. And after that we actually build the dependency tree. And this is basically like the top level dependencies, but actually we do have more dependencies that actually depend on those packages that are specified in the cargo.tomel file. And as you can see I didn't of course go further, but this is generally how our dependency tree look like. And in let's say like academia, we usually build dependency networks to actually understand how package repositories work like and maybe also understand problems like the left pad incident. So over here I have three packages. So one is package A, B, and C. And what we do is that we join them based on the same package version name. And this way for instance you can ask like, for those who depend on left package, how many of them would be affected if that one will be removed? And yeah, as I said, we bring those basically merged together into a single network. And then we have call graphs. And here I have a sample code, probably Rust Busudo, not exactly Rust code maybe. That actually looks into what is, sorry, that actually takes, it receives notifications, but we don't actually do nothing with it. So how do we build a call graph in this code? Well, the first thing that we do is we identify all the function calls and also the function definitions. So we have like the main function and also like is ready. And from this we look at who calls each other. And then we get a call graph like this. And in Rust we can actually do this using, for instance, like the LLVM IR code, which was mentioned in the previous talk. So you can, for instance, use the LLVM OPTI tool to actually generate the call graph. But then our days like this, so one problem is that when you analyze programs, you only get partial of it. And our days actually go beyond a single program, actually into the dependencies and get the full picture of actually how all functions are called. So here, of course, we have example with the connection and our own app, but we can also look at what functions do they also call the connection function. So we see a freeze like it calls something like request and the other one is like network link, for instance. And yeah, and then the second part that we do these days that not only do we only capture the functions of it, but we also annotate it with two versions. And by doing so, we can merge two concepts together, which is the dependency network and the call graphs. And together we get something called a call-based dependency network. So you might be wondering why should we do all of this in the beginning? Well, as library maintainers or even if you own something like Grace.io, you might be interested to know sort of things what's happening in the community. And so, for instance, like this is, of course, this doesn't really exist, this is just a vision, so you might, let's say, like install Prezi, and then you want to publish a new version. But then when you run it, you get a failed. And yeah, so the idea, why you got a failure, you remove the deprecated function, but then when we look at it, why did it get failed? Well, it's because you wanted to publish something that would affect 15% of Grace.io. And a good idea here could be that instead of making this as a problem for those who depend on your library, it can be put on, let's say, like on hold, and maybe some further analysis can be made on this. And then as an example, so like there may be the limit that threshold is 2% that you're allowed to break in Grace.io. And after this, you can also like analyze adoption. So here I had like two versions. So you can see, for instance, when you release your new version that from today and looking back one week, you can see that there are 5% less that they're using your version of 0.5.0 and there are 12% more that they're using 0.18. And there are also many more applications like, for instance, like health aspects, maybe also licensing, security, and also maybe we can learn cold patterns in general that can be for education purposes or other aspects. So the other question is like, okay, this maybe looks really cool, but how do we actually turn Grace.io into a gigantic call graph? It might sound simple on paper, but actually it's much more difficult than that. So the first thing, how do you actually first compile 22,000 packages? Because in the cargo.automobile file, you don't really know what compiler version is compatible, but also which architecture, for instance, because some packages are for Windows, some are for OS X, some are for embedded devices, so it's not all at trivial. And then also the other part is what is actually the entry point, what represents a package. Should it be the library component? Should it also be the binary, et cetera? So it's not that simple. And the other aspect is the version resolution. As you can see with the version range, they are time dependent. If you solve a package today and you do it one week later, it might be completely different. This is also a problem that affects how the complete graph looks like in the end. And then with call graphs, call graphs are an approximation of how a program is called. But there are two important aspects. One is related to precision. So with precision is that you want to be sure that the call graph that you derive from the server is exact and complete. And the other one is to do with assignments, because there are aspects, for instance, like a dynamic dispatch, which is not easy to handle. And to sort of give an approximation of how it is, you can, for instance, like add unlikely calls. For instance, like you might have multiple implantations of a call. You basically create edges to all of them when you know through exactly which one it will do at runtime. So, spider's aspect, I still went ahead and did this. And to give an overview of the approach. So the first thing that I do is to retrieve and build the packages. So that means like downloaded packages. And then I did also some like cleaning in the Kroger's Homo files because in some of the versions you actually specify path dependencies, which does not exist when you use it. Then I validate them and build it. In the second step, I generate the call graph. And as I mentioned earlier, I actually did this with using the Elevent call graph generator. I know from discussions on the Rust forms that there are different ways to do it, but this was the way that I started initially. So the first thing is that you get the call graph. You get like this very mystic function called identifier, but it's not really really mystic, but that's Rust mangle function identifiers. Then you de-mangle it. In this case, I actually used the Rust field tool to do it. And then in the third step, I build unique identifiers, which I show like in the earlier call graph with the versioning. And then similar to dependency network, I merge them together. And then finally, once you have annotated and given all function identifiers their unique name, you can merge them together, and then you have your call-based dependency network. So the first step is... Oh, sorry. I'm going to talk two main challenges here. So one is, of course, with the compiling. And then the other one is with the function identifiers. There are, of course, many other challenges, but I won't have time to do that. So I did this back in 16th February 2018. So it's almost one year ago since I did that. And when I first tried to attempt building it, I got quite a few errors, which is, of course, expected and didn't make me so happy. So one of them was, for instance, like couldn't load, like the source dependency. Another one was, for instance, like could not run nightly features. And then there was also cases where there's a custom build command that I could not execute. I was able to mitigate some of them. So for instance, like the first case that I show over here is... I did basically by using a rewrite command in cargo that basically gets the published version of the cargo.toml file and not the one that you actually download from the API. Because sometimes they have path dependencies, so you don't... I mean, they shouldn't really be there, but yeah, so that was an issue. And then with nightly features, what I did is that I run a couple of nightly compilers to get it working, but then the problem is that you don't always know what exactly is the right nightly version to run for a crate. And then the last we've got some build scripts. So one way we mitigate it is by installing a lot of system packages. I learned recently there's actually one Docker file that actually does have all those system dependencies. I haven't tried that myself, but there is a solution to it. And I think one important thing here, which I would like to see in cargo, is that you can actually validate that the package can compile, but also that there are features that check which version of the compiler you use, what environment, etc., being taken into account. Yeah, so when I was skip out there, some other errors related to it, which is, for instance, like when I was trying to build that many of the packages actually use a trait but incorrectly. So there's a lot of these type of errors, so I find quite a few of them. And what is like the final compilation statistics. So after removing invalid manifests that exist over there, I had in total 12,000 packages, and then that went down to 72,000 releases in total of those packages. And then out of those, I managed to build 49,804 core graphs, which is in total 11,000 packages or crates. It took me 69 hours to do it. Of course, I didn't do it on MacBook. I actually used a cluster at the university for it. And in total I could build at that point of time 70% of crates.io. So we chose that there are some actually really good guarantees with cargo and crates to actually build packages. All right, then the other part is with the Rust symbols. So I wanted to annotate version numbers to it. But then the problem here is that I cannot use simple regex to do it, unfortunately. So I had to actually build a parser on top of it. So luckily, sorry, you can see for instance like it has this semver as core, and here's basically like the semver version rege implementing the traits partial ordering. And to solve this, I used, let's say, like a build on top of sin, which actually parses Rust code and added this specific Rust CLLVM symbols to do it. And by doing this, I was able to annotate version numbers. And for you know, for me to append the version numbers, I basically, let's say, look at the cargo log file of the package after building it, extracting this information, and then upending it to it. So I add first like the ecosystem that I use in the library version, the module, and the function name. And this is the way I create a unique function identifier. So I'm not going to do a live demo, but I'm going to show a little bit what you can do with it. And I did actually two applications with it. One is, of course, the popular security. And I also did one with deprecation. I think many of you probably have seen this, of course, not with Rust code, but probably with JavaScript and I think also probably RubyJumps code. So I tried this using the RustSec database. So at that point of time, they had six advisories. And from those advisories, I could extract certain functions that were affected by some form of vulnerability. And I did this using both a regular dependency checker, that means I looked just at the package information, which I showed before with the dependency checker. And then I used with my RustBretzy to see the number of packages. And here we can clearly see the advantage of having doing it on the call graph level because we get more precision and avoid false positives. And as you can see, the numbers are much less, which means that I, as a developer, don't have to actually really go through too many packages when I have to see whether there are false positives. And to see if this result is actually accurate at some point, this is just a lot of false positives. So I looked only at the direct dependencies and analyzed that. And I found that using RustBretzy, it's actually three times more accurate than using the regular dependency-based approach. Something that I didn't mention here in the slide is that there are some problems with respect to the completeness of RustBretzy because I do not get, for instance, dynamically disparate functions. Also, there are some problems with conditional compilation, et cetera. But in principle, by doing it on the call graph level, we can actually have higher precision when we do such type of analysis. And also a really cool thing is that I was, of course, posting this on the Rust forums. I'm not sure if you have seen it, but, yeah, so Tony from the RustSec community, he added this feature with affected functions. So that's a very nice thing for me. Later on, that I can actually easily just import data from the Rust security database, get the function to fire an inquiry on the RustBretzy graph. So the other case I looked at was deprecations. I did a very small study. So I looked at, for instance, like the questionnaires, like how many will be affected by a removal of a deprecated function? So this was the main reason why I wanted to do this. And so I looked at functions, like using my dataset of all packages. I looked at how many of them actually had a deprecated function by using this annotation. And I didn't find too many of them. I found from 11 releases with six packages. So with these ones, so basically once they were actually used by other people, there are, of course, many more. And I found that in total, those were actually using those 11 functions were 311, let's say, packages. And those are not only top-level packages, but also transitive packages. And if I would try to remove those functions, 52 of them would actually be affected. Of course, whether they're actually removed or not, that's a different discussion, but this is some form of analysis that you can do with RustBretzy. And yeah, so I really want this to be a community effort. So the way I envision with RustBretzy is that I would like to do analysis and do data-driven decisions. For instance, the Crater project, which runs like a regression on, yeah, Crater.io using the compiler. But also, for instance, like the ecosystem workgroup, where they actually try to find crates that have, for instance, like not been maintained for a long time, or some other problems in which they actually need some attention, and also their security workgroup, as I showed in one slide. And yeah, this is a bit like, maybe what I really would like to, sorry, this is really what I want to go forward with. And there are, of course, many open problems. So this is actually a prototype. And as a researcher, it's, of course, I'm a researcher doing development, but I really hope to make this something that actually can benefit the community as, yeah, the wider community. And yeah, so as you've probably seen on T-shirt and also many posters, one way to actually help me to do RustBretzy better is to actually take my survey. So unfortunately, time is long, but I would really appreciate if you could fill out some problems that you experienced with dependency management or other things, or other point of suggestions, yeah, with dependency management in general. So thank you very much. I'm ready for questions. So your question is about what is the number of false positives in general with using RustBretzy, but also like with control full graphs, why I don't include that. So with respect to control full graphs, because I wanted to build this one ecosystem, the first level of granularity that I wanted to look is at the call graph level. The other days that, for instance, I can slice part of the call graph, I mean, let's say like those affected paths throughout the ecosystem, and do more fine-graded analysis by actually extracting the control full graph for that part of function. So going one step level down. With respect to false positives, so because I'm using the LLVM call graph generator, it is precise, but it is not sound. So it is missing dynamic function invocations. It cannot handle, for instance, like generic functions. So this is a big problem in using this. And of course I'm trying to look for better call graph alternatives. So I would really like some help here and actually how to do it better, but also do something that is more, let's say like complete. For instance, like in Java, you have Sotovala, which covers a lot of features. It would also be very nice if there is something similar for Rust as well. Any more questions? Yeah, I guess that's it.