 for trustworthies binaries. Helping detect vulnerabilities being injected in the build process, Rebooth will tell you about it. Please give him a round of applause. Thanks very much. Yeah, I would like to take the opportunity to tell you a little bit about reproducible builds, which I am a huge fan of. But before we get into what reproducible builds are, maybe let's first talk about who they are for. And I would say reproducible builds are both for developers and for users. And I'm using these terms like in the loosest way possible. So a developer is any person or group who builds something and then distributes binary somewhere. And a user is anybody who gets a binary from somewhere and wants to run it. So I will take the developer perspective first, and I will return to the user perspective later. And the goal from the developer perspective is that you want your users to use the binary and you want the binary to do what you intended it to do. So you wrote some code. And then you hope that the binary that the user runs corresponds to the codes that you wrote. So reproducible builds isn't a specific technology or tool. It's more like a technique, so something that you can apply to all kinds of different ecosystems. And we will look at a couple. So you might say, why is this not obvious? Like, I'm a developer. I built the thing. I package it up. We know how to sign things and ship things to the internet. Users know how to check signatures. So what's the problem? So the problem is the build step. So this is the super simplified version of the supply chain, right? And what reproducible builds helps you with is checking that no foul play happened in the build step. So the build step could be happening on the developer's machine or it could be happening on some CI system. And as we all know, because we are at the hacker camp, machines can get hacked. So if that machine is hacked, whoever has control of the machine could inject some malware into the binary at that point. And because at the build step is where you sign the binary, you will sign it with malware included. So signatures don't help for that purpose. So, of course, the pipeline, in reality, is not so simple. As an input to the build, there's not only the code itself, but it's also the libraries that the code depend on and also the plugins you have for your build system, for example. So in reproducible builds, we all assume that that is all right. So we assume the code has no malicious stuff in it and we assume all the dependencies of the build are also OK. So here there's a bit of a chicken and egg thing or turtles all the way down. At some point, you want to check that those libraries and build tools you're using are also malware free. And that's outside of the scope of reproducible builds itself, but it's also something you want to do. So the core idea for reproducible builds. So does this actually happen, you might ask? Like, this sounds like a really niche things, hackers pulling things into your build process. But actually, I think this is an increasingly sensitive part of the software supply chain. So in this example from a while ago, 2018, someone hacked the Jenkins machines from which Homebrew was being built. And in this case, when they were on the machine, they stole the credentials. But you could very well imagine that once you have access to the Jenkins machine, it's relatively easy to make changes to the build process and inject malware there. And it's easy to imagine that those kinds of machines get hacked because nobody really likes CI, right? Or actually, I really love CI. I love that it's there. But nobody likes managing and updating those systems and keeping them safe. It's not a fun part of the development process. And especially if you're in a team, you can definitely imagine a horse falling through there. So because we cannot be perfect, what do we do to mitigate the problem? And that's where reproducible builds come in. So the core idea for reproducible builds is that instead of building it once and shipping it, you build the same code twice. And the idea is that those two builds are as independent of each other as possible. So ideally, they are on machines who are managed by different people, or at least different credentials. Maybe they can have a different operating system even depending on what kind of thing you're building. And then the second build, it doesn't need to ship the entire binary, but it could also ship at the station that says, OK, I built this. And the hash of the resulting binary is something like this. Now, why would you do this? If you have done this, and at the end of the story, as a user, you check that what was shipped is exactly the same as the thing that came out of the other build process, then you can be a lot more confident that no malware was injected. Because if they're exactly the same, if any malware was injected, then it would must have been injected in both the build processes. And because those build processes are completely independent, the chance of those two build processes both being hacked is a lot smaller. Makes sense so far? And that's actually also where the logo comes from. So basically, the top dot represents the code, which is a single unit. Then the two dots at the middle are the build steps, are the two independent machines who built the code. And then hopefully, they arrive at exactly the same dot at the bottom. So that's pretty good. And this is a little bit different from other uses of the word reproducible builds. So in the case of the reproducible builds projects, we're really talking about the end result is bit by bit the same. In some other contexts, you might see reproducible builds in just, it was possible for me to build this code again, which is also very useful. But in reproducible builds, the reproducible builds project, we end up with exactly the same thing. And that means it is easy to check that it's likely that there's no malware introduced. So one example is a project that I've been working full time on for a while is the ACA project. And this is a Java project. So in the Java ecosystem, what you do is you upload your jars to Maven Central. So there they are. But how do we know that we haven't shipped any malware in these jars? Because we did the work to make sure our builds are reproducible, after a release, usually on my laptop, I do the build again. And I see if I get exactly the same hash. And this turns out, in practice, this is definitely possible. And it's really possible to also do it on pretty diverse machines. So we've had releases where the locally built one was made on Linux, and someone else reproduced exactly the same binary on a macOS machine. Yeah, that gives us a high confidence that nothing was injected in the build process. So then you might ask yourself, OK, but why wouldn't some software be reproducible? So compromised build infrastructure is, of course, one thing we protect against. But why is this not a trivial problem? Well, it's not a trivial problem, because in practice, we don't have a habit of really thinking about all the things that can be nondeterministic in a build process. And so I'll give some examples of it. And sometimes it actually reveals some pretty subtle bugs. In other cases, it's just random things that are nondeterministic. That's just nice to iron out. So an example of a bug like that is there was a project which created a random seed for every build because it wanted every user to have a different random number progression, and it accidentally did that while building instead of while installing. So that was kind of a significant security problem, because that meant that everyone who took the same binary would get the same random numbers. Well, what you would want is everyone to get different random numbers because you want random numbers. So that's an example of a bug that you could find by applying reproducible builds to your builds. Other common things you see would be build timestamps. So if you have a timestamp in your build that says, OK, I built this at this and this time, of course, if you built it at another place again, then you get a different timestamp. A common way to solve this is to use the, instead of using the build time, use the time of the latest commit that you're building because that's actually what you want to know. You want to know where this code came from and not at what moment that you built it. And a convention on how to do that is to look at the source date APAC environment variable. So the convention would be that you put a reasonable, but static or predictable date in there and that all your tooling would pick that up and use that date instead of the current time on the clock. Another common thing we see is that file ordering is different, is not consistent. So that's usually easy to solve by just sorting the list and there's all kinds of other things. There's hash implementations that are different based on the machine you're on. The binaries could be different because you're on a different locale in a different time zone, et cetera, et cetera. And now there's also additional advantages to having your builds reproducible. So if you have the nice property that the same code always results in the same binary, in some cases this also makes caching much more efficient. So there are build systems such as, I think, Bazel that will see if some code changed but it leads to the same binary that will then not rebuild any dependencies on that thing that it could see was the same. So if your build was not reproducible, it would have a cascade of actually unnecessary builds. But if your code builds reproducibly, that whole subtree does not need to rebuild every time, which is very nice. Okay, so say you set yourself the goal, okay, I want my build to be reproducible. You build the same code twice and you get a different binary. What do you do? A very useful tool in this case is DiffoScope and that's actually a project that came out of the reproducible builds project. And that's a tool that will show you give it two binaries and will show you the differences in those two binaries like a diff, but it actually knows about a shit on of formats. So if you're comparing two zips, then it will not say like these bytes are different than these bytes, but it will actually give you a useful difference like okay, these timestamps or these orderings or these files inside the zip have these differences, which is very helpful in quickly determining what is different between the two binaries. Another thing you can use is a build info file. So that's sort of a convention in the reproducible builds world is to not only produce the thing you're building, but also produce like a separate file in which you record a lot of information about the system on which you build. And there you would typically also include information that shouldn't impact the binary, but might in pathological cases. So then if you see in the wild two binaries that are different, that you didn't expect to be different, you could look at the build info metadata to see, okay, this difference is, I see the pattern that this, it always looks like this on macOS and looks like this if it's built on Linux, for example, and then you have a easier, yeah, it's easier to know where to look. So I think these are the main reasons you would care about reproducible builds as a developer. So you get more confidence that what you're sending to your user is actually what you intended to send to your user, which is nice. But it's also, of course, useful to the users because they want binaries without malware, but we can go a little bit further than that actually. So we can identify two types of users. There's users of closed software and there's users of free and open source software. So for users of closed software, there's really no way to verify for yourself that your vendor has used reproducible builds. Best you could do is ask them if they do it, and I think this is going to get much more popular in the future. So if you look at, for example, the Salsa guidelines that a group of companies, I think including Google, are setting up, they're making recommendations on how to set up your pipeline, and they have four levels of how mature you are in that. And if you want to be Salsa level four, then you need to do reproducible builds or just, or you have to have a really good reason not to. And so at some point, you might ask your vendor, okay, can you promise me that your maturity in your software supply chain is at least Salsa level four? And to be able to say yes, they would have to at least look into reproducible builds, which is nice, I'd like to do it. But I think where reproducible builds really shines is on the free and open source software side. Because there it's really sort of superpower. Because for the developer, okay, it's just the developer who builds the same thing twice and looks if it's the same thing. But in the case of open source software, you can also build the same thing as a user or as a user community. And that means if you have audited the source code, you can now also independently verify that the developer, that no one, that the binary actually corresponds to that source code. And I think that is a huge deal because that reduces the attack surface by an enormous amount. Because you can independently verify that the binaries are okay. This rules out, for example, blackmailing contributors or contributors who have been away for the project a long time and their credentials got stolen or coercion or all these kinds of attacks. Certainly you don't need to worry about them at all anymore because you can independently verify that the binary actually corresponded to the code. So I think that is a huge deal. So now we come to the point in the presentation where I have to tell you I lied a bit. It's not just developers and users who care. There's actually another big group who cares a lot about reproducible builds and that's distributions. So Linux, there's Brutons and other ones. These typically sit kind of between developers and users. They often build the software on behalf of the developers for the users. And that makes the distributions like in the ideal spot to leverage reproducible builds to verify that the end result is actually correct. Also distributions typically have a lot of infrastructure and a lot of contributors. So it's kind of the perfect use case where you properly securing all that infrastructure is super hard. So having the extra insurance that reproducible builds gives you is extremely interesting for distributions. So a lot of distributions are really active in this. Debian has traditionally been a huge driving factor of the reproducible builds project in general. They have a bunch of packages that are already reproducible. There's good work going on, also making the live images, the ISOs reproducible. One thing that is kind of a theme is what's often missing in the reproducible builds ecosystems is an easy way for users to do that verification. So an easy way for users to consume at the stations by other users that they have successfully reproduced the project. That is definitely in many cases open research. So Debian has a sort of experimental plugin for APT that can check for reproducibility at the stations, but this is definitely not something that is in common use or actually practically usable right now. Arch has a bunch more tooling in this respect, I think. But like Debian, they're not really there yet. There's some core packages that are still need some work. Open Susie is great at upstreaming work, so they're also very active. NixOS is a Linux distribution that I'm personally a huge fan of, what I like about it especially is that you get very stable dependency trees which makes it really reliable to achieve relatively easy to achieve reproducible builds because you know exactly which versions of your dependency you will get. It's much less likely that you will get differences in your build because you happen to build it with a different version of a dependency. That's just because of the way Nix is set up. That is all, the inputs are always very consistent and that makes it a lot easier to make sure the outputs are consistent. So Nix has some tooling built in to check reproducibility, so with just this one comment, Nix built this check comment, you can check that the binary that is in the binary store actually corresponds to something you build locally. Almost all of the installer is reproducible but also in NixOS actually consuming at the stations by other users is something that's definitely a research project. So one very interesting development here is Trustix. So they try to be a sort of a proxy which you can inject rules into which can verify the existence of at the stations but it's not something you can use right now but it's super promising I think. Geeks is also a very active similar to Nix in the sense that you can be certain that your dependencies are consistent also it's pretty interesting tooling so Geeks Challenge is a comment with which you can in one go check the reproducibility of a whole basically a subtree of a package which is cool and they are also a foreigner in bootstrapable builds which bootstrapable builds is a sort of a sister project to the reproducible builds and deals with the fact that aside from the code we also need to trust the compiler and the dependencies and stuff like that. Bootstrapable builds tries to make that to make sure that that is not too much dependent on binary blobs. So bootstrapable builds tries to bootstrap your environment as much from source as possible so without relying on binaries which is very interesting to look at. Aftroid, the popular Android package store is also also does a lot of reproducible builds work but is currently not really surfacing that in the API sorry, in the UI. I think they're definitely interested in doing that but yeah, there's only so many hours in the day of course. Tils, the ISO is reproducible but it has to be manually checked. Yeah, so that's a bit of a whirlwind tour at where different projects are with reproducibility. So if you ask from like, okay, what's next? What are the next steps? What are the most important things to work on for reproducible builds right now? For users, definitely for closed source things, ask your vendors, ask them if they're using techniques like this in their supply chain. For open source stuff, see if your favorite packages are reproducible, try to reproduce it if there's any instructions. If there's nothing there, maybe just try it or just build a run-difuscope on it. There's a lot of tough nuts to crack but there's also a lot of low-hanging fruit so if you find something easy, starting with reproducible builds could be a really nice way to start contributing to a package that you're using a lot. So just build it twice, run-difuscope and see if there's, yeah, you can make sense of where the differences come from. As developers, of course, try to reproduce your own projects. I want to shout out to this get-to-propos tree. I probably didn't backdoor this. This gives you a sort of, takes you through a Rust project and shows you how you would make reproducible, create an alpha binary, a Docker image, an R package and sort of shows you, in practice, what it looks like so you get a more solid ID of it. Aside from reproducing your own builds, checking your upstreams is, I think, an area where a lot of interesting problems are still to be solved. This is very much depends on what kind of projects you have, what kind of build tools you use. So all the libraries you depend on, ideally you would check that they can be reproduced but how would you do that? On the other side of this coin, empower your dynastreams. So make it easy for the users of your projects to test for themselves that they come to the same binary. And if you're using a distribution, then definitely see if you can help your distribution become more reproducible. So a ton of super interesting work to do, I think. ReproducibleBuilds.org is sort of a central hub where a lot of this works comes together but because it's just a technique and not a technology, a lot of the work actually happens in the different ecosystems or in different Linux distros, in different build tool ecosystems. So definitely have a look there. I will also make sure that the slides in the video are uploaded there and with that I would like to open up for questions. Oh, yeah. Thank you very much and if there are any questions please land up in further microphones in the middle and please go close to the microphone that we can hear you properly, microphone please. Just, it's, try again. Test, test, test, test. Yes. There you are, that's better. Okay, yeah, thanks for that talk. You mentioned that the attack surface has been reduced significantly because now we cannot pressure GitHub developers to check in malicious builds. However, they can still be forced to check in, let's say, the second line of go to fail, right? So, okay. Right, so you still have to audit the source code. You still, because like in the pipeline we're trusting the source code and so to be able to trust it you will have to audit it that there's no malware there and assuming that you have done that then the attack surface is reduced in the sense that the developer cannot have injected it later in the pipeline. All right, yes. I have a second question. Yes, you can. Someone else? Yeah, go on. Yeah, sorry, I don't want to be too pedantic there but I saw at some point that you provided checksums as well or that you computed checksums of the binaries. Yeah. What would be the reason to use MD5 or SHA1? Ah, in the ACCA example. Right, because that's what people do in the Maven ecosystem. So, let me go back to that slide again. Yeah, there it was I think. Yeah, exactly. So basically that's not really part of the reproducible build story. Yeah, I think these are mainly there for legacy reasons just to check that the download didn't feel and stuff like that and not for security reasons. Okay. If you would create an attestation and so you ship some coast and also you create an attestation that say, okay, I built this and the hash of the binary is this and this, then you would use a stronger hash like SHA512 or stuff like that. Definitely agree, yeah. Okay, good. Thanks. Cool. And one last question. Yes, please. So there are techniques to content address and it does reproducible builds.org. Are they interested in using that technology? Like I know Nix has the Nix archive and Nix wants to become content address and you know IPFS exists and IPFS has a format CAR content address archive. Does reproducible builds.org find interest in this and how is it going to use it in the future? So I think those two fit really well together but because the more reproducible built your project is the more likely it is the next built will have the same hash and so the more likely it will be you can, you already have it and you don't have to look for it again. I don't think reproducible builds.org specifically wants to do something with it but I definitely think they play in each other's strengths. Yeah. Okay, then thank you very much for the interesting talk and please give a round of applause. Thank you.