 Yo, and welcome to this presentation about reproducible builds. I think this will take 20 or something at this point. We're going to talk a little bit about what reproducible builds are, what Arch has been doing in the past year, in terms of reproducible builds, how it works, and how you can contribute. So I'm Morten Lindru, I also go by the nickname of FoxBoron on the internet, work as a security engineer, as my day job, and I've been sort of contributing to the Arch Linux distribution since 2017. I'm a trusted user and I do a lot of package maintaining, such as a co-compiler, I3Gaps, DVM, a lot of the container ecosystem, in terms of Podman, Container, LXT, all like I also contribute to the security team doing CVs, security tracking, publishing advisories, and most importantly, I do reproducible builds. So we're going to sort of work through how the supply chain in Arch Linux works, how we do package building, how reproducible builds are, and how it's sort of organized. We'll take a look at the current tooling we have for users, which we have been working on the past year, and then we'll talk a little bit how rebuilding works, and then lastly about how we can contribute. I'll also try to do some live demos, so you see how that goes. So the supply chain of Arch is essentially how we take the pristine upstream code and deliver it to point B, which is the users. And generally how we do that is that we track down the project repositories for the different projects. We do the integrity checks, which is the checksums, make sure nothing is corrupted during download. And then we take the signatures to have authenticated releases. We know that someone did actually authenticate the release, there was nothing in between doing that. Next up, there's the building part. Some people use build servers. Some people use the local machines using the tooling we provide. And then lastly, there's the signature step, which some people don't know the ArchVice or the laptop. Some people have done everything on the laptop, and some people do GPG forwarding. Once the packages has been built, we upload them to Gemini, which is our tier zero mirror, and it's then distributed to tier one and the tier two mirrors, which then allows you to download packages. And this is sort of like the few steps from point A to point B. It's not a singular step, it's a few more different variables. But we generally build everything in clean fruits to have everything isolated from the main systems. That ensures we don't have dependencies which pollute the build. We don't have environment variables that affects the build, so on. So in theory, nothing should affect the build. Nothing should change it, so to speak. So what are we trying to do? So currently, I have built Pacman, the last release we did. And this is essentially the last release that Alan did in July, releasing Pacman 5.2.2, and we have so far already built it. So what we're going to do is that we're just going to go ahead and build it a second time. In theory, it's all isolated from the rest of the system. There's nothing that should affect the build, right? There's no external dependencies or no environment variables. Let's take a look at how this goes. It's not running through the tests of Pacman. I've done this a few times now. It shouldn't fail, which would be very hilarious. Now we're doing the packaging step. We take all the main pages and the locals and stuff, and we compress it into a package. Now we have some sanity checking tools on top. So in theory, nothing should be affecting the build, nothing should become between it, but still we see two different checksums. So this is a bit strange what's the reason for this change. And if you use DepoScope, which is essentially a tool that's created to check differences in tarballs, binary files, and stuff, you quickly see that man pages have embedded timestamps, which changes the MD5 and the SHA256 digests. You also see that the directory inside a package has different times. So we're not completely isolated from the rest of the environment. So what Reproducer builds are is essentially how we can ensure that the source code we take and the binary artifacts we provide, there's a verifiable path, which nothing actually goes in and change. So this was started in Debian in 2014. It's currently headed by Holger Levson, which contributes to Debian. And there's currently quite a lot of stakeholders these days. Debian, OpenSUSE, NixOS, Geeks, OpenVRT, and Arch Linux, of course, and a few other projects as well. So Reproducer builds provides monthly reports, which provides sort of details the progress that's been doing the past month. On the Arch team we have Anthrax, KPC, YRD, Allen, and many, many other people that contributes to the project on the Arch side of things. And generally, there are a few people that have been doing this for the past year. It started with Anthrax in 2007, no, 2005, 2016, I think. And later on, there's been more and more contributors from Arch, which has been invested in this. There's been several summits since 2015, which has been going on in different places. Arch has been represented in all of them, I think. There's a GIF from the one that was held in Marrakech in 2019. Southerner won't be any summits this year, but it's a productive thing where a bunch of people from different projects, different distributions discuss and try to hack on different issues within the Reproducer builds space. So the main goal is again to trust the packages, but also be able to verify the work they're doing, not just blindly look at me and say, yeah, I trust you package I3 gaps with this version using this source code, but you essentially have no way to verify. So what you do, you take my word for it, you should probably be able to check it yourself. So Reproducer builds project defines the source state epoch, which is essentially an environment variable that allows you to specify the timestamp of the build, which then can trickle down into the build system and make sure that sort of things that we saw earlier doesn't happen, that you always have the same timestamp on the directories that get created. Now we also have the build and for file, and that's sort of a build of materials. It's supposed to denote all of the different things that go into the build. Here you see the building for file that's used in Pacman. So it denotes the package names, package versions, the package architectures, the checksum of the package build, and the build date, build directory, and then all of the dependencies. This format is described in the main page, the build info, and I really encourage you to go check it out. So to provide user tooling, you have to sort of read this file to have a recreatable environment. So we use the build date, the build directory, all the installed packages, which compression was used. Most packages use C standard, but there's still some left that use Xset. You also have to take the Packager environment variable. And there's currently two tools that allow us to utilize the building for file. It's Arch Linux Repro, which is a tool that's more of not only intended to be run on Arch, but also on other distributions. And then it's also supposed to abstract a little bit away from the nitty gritty details of how to fetch the package build files, how to retrieve the source files. And then you have make Repro package, which is provided by DevTools. It's more of a developer-friendly tool, which does not abstract away that many details, but it's sort of easier to use in a development setting. So we have two different package files. So what we are going to do is that we're going to run, yes. So we're going to go ahead and run Repro on the first package that we used. So that's going to be building a second time. So here we are sort of reading the build file, fetching a bunch of packages from our package archive, which is matched to the building for installed values in the file. We'll do a little bit of a dance to ensure that the environment is consistent, which involves installing packages two times, essentially once with just new packages and then reinstalling all the packages to make sure we have a consistent environment. And then we're going to go ahead and build Pac-Man. So now we're soon done setting up the environment. Now we're downloading Pac-Man. I'm also in this version not running the tests because in Pac-Man, though, they do not affect the build, they shouldn't really affect the build, but sometimes they do. That's not the case in Pac-Man and in most packages, really. So it sort of shortens down the build time. So now we're done with packaging up all the man pages, cleaning up ourself. And then it reports us as reproducible, but we can just quickly compare them. I mean, now we see that both of those packages has been reproduced. It's identical. There's no differences in the file. And sort of that's the intention of the tooling. So we have sort of been rebuilding packages for a while. It's called the CI system. It's hosted by reproduciblebuilds.org. It's been going for soon in two, three years now. But it's called the CI because it doesn't actually take the distributed packages and then tries to find upstream bugs, doesn't try to find flaws in the distributed ones. It only takes checks out the source files, compiles it two times in different environment settings to sort of toss out in the bugs in upstream. So it's more of a theoretical value than a practical one. What we did, however, we get this year was rebuilder from KeepCYRD. And it's sort of a distributed CI system, which uses RepRow and then archive packages, which are the ones we are distributing currently. It has three nodes currently, which runs on our infrastructure and currently runs one of them runs on the abandoned Josephine. So this isn't only checking out the source files, it's also testing out the actual distributed packages. So the comparison on those two is that the CI has the 3% reproducible packages, extra has 85% and community has 77%. That's the graph which is published and you see it's sort of filled a bit in the summer and then we fixed it up sort of recently and it's been checking away nicely. The rebuilder setup power has slightly better results overall. That's because we are not taking the sort of changing values, which we normally do not change at all. So core is at 94.4%, extra is at 90% and community at 78.6%. And this is actual reproducible packages today. And that's sort of nice. We're not completely 100%, it will take a while before we are at 100%, but we're sort of getting very close. So that's fun. So we'll take a look at how to actually reduce packages. So I have a package. So sort of one of the important packages in Arch Linux is the key ring, which contains all of the package signing keys from Arch. So actually I can show. So this is the key ring. So this is the key ring I have installed. I'm fetching it from my package cache. So we'll actually just go ahead and reproduce this package. So again, we're now doing the same dance with the last time. We've been getting all the packages, recreating the environment and then actually taking a distributed package and going to reproduce it identical. It has a few less dependencies than Pac-Man. So it's not easier, but it's faster to reproduce than what Pac-Man does. Now we're doing the second dance again. Then we're heading over to the building part of Repro. Fetching the sources. Now we've already built it, cleaning up after ourselves. And you see it's reproducible. So we're just going to quickly do right now. So we have that package and then we have two, five, six some. Yes, it's not understood. And you see that now we have the same checksum. So we have now actually reproduced an actual package that Arch distributes. So what's the currently reproducible package? So we care a lot about core because it's sort of the central packages. And it should sort of have a high standard and stuff. And there's a few packages. Mostly Linux kernel is not reproducible because of different issues we'll get into. I think we're trying to reproduce GCC as in recording this talk now. And there's still like a few packages with different issues. The problem is that they're complicated. Sort of legacy build systems that's hard to navigate. So there's quite a lot of effort left. You should be noted that the tooling that you see here is still experimental. Things can change. We can encounter new issues in Pac-Man, in Repro, in the CI or the rebuild step. So if something turns and reproducible tomorrow, it doesn't necessarily mean that something bad has happened. We have issues with like the private keys generation in Linux, where we generate a set of keys to sign modules. But these keys are sort of uniquely created for each package. And that implies that the package, by default, is not reproducible. That's more of a package issue, more so than a upstream issue. It's more how you separate them. You also have things like the path being embedded multiple times, depending how you source the files. And that was a reason by that we also discovered that most of the Haskell packages are reproducible. This has been fixed recently. So if you see the app and path problem that some people have has been having, that's served the same issues. We also have the pattern hash seed, which is essentially a problem where we generate a bytecode from the Python compiler. It's not deterministic. So it will try and randomize the keys and dictionaries and object tables and all of those stuff. And it makes packages for us reproducible. In WN, they're separated out and then generated at installation. But we sort of bundle all of that together. And there's still general packaging issues in Arch. But there's still a few issues which are not related to Arch. So sort of a common thing that packages does is to get patches. And one of the few ways you can get patches is to just find an upstream repository, find a commit hash, and just add path at the end of the URL. That will give you a nice, de-formatted path patch out of it. The problem is that we recently discovered, or we sort of discovered that in GitLab, these patches are not reproducible. Because I'm not sure if you actually see the issue in the patch, but the Git version is actually embedded in the patch there. That means that whenever GitLab, like the main GitLab servers, update their Git version, the patch will have a checksum that changes accordingly. That's sort of one way it's a reproducibility issue to its problem where integrity checks fail over time. So that was fixed by a few helpful people in the Arch Linux reproducible channel. And currently, this is not an issue on the live servers. But it's still a demonstration how there's not only packaging issues, but also upstream issues that can affect the build. So ways to contribute to reproducible builds is to find your favorite software, try to reproduce it. It doesn't reproduce and try to figure out the flaws. Reproduciblebuilds.org has a lot of documentation, comments, and information that you can use to sort of figure out how to solve these issues. You can also sort of work your way through packages. So I built this package earlier today. So what we're going to do is we're going to reproduce it. Let's release. Yes. So what this does is that it reads the package build file in the current directory. It runs typical scope. And then we take the package in the current directory. So what you should do now is that you should just try to produce it with the files we have checked out. And then we're just going to try to see if we have a similar package from the one we built earlier today. So again, we're setting up the environment, reinstalling the packages twice. Now we're doing the build. So what we're hopefully going to see is that the package is not reproducible. And what the changes we're actually going to see is that the GCb compressed man page has a timestamp embedded into it, which again changes the checksum of the man page and then gives us a different result. So in this case, I've actually made the package reproducible. It has been fixed. So I conveniently removed the make man page reproducible patch. So if you take a look at this patch, we see that it appends the dash n switch to gzip, which makes the package man page reproducible. And then we also make sure that the summon by the timestamp in help to man is also adhering to source state epoch. And with all of these changes, we'll now just create a package from scratch. That doesn't take a lot of time. And then we rerun the same command we used earlier. Now we also see that there are some, quickly, some package updates. Now we're, again, reproducing the package. So now we're installing all the packages again to make consistent environment, deleting the snapshot clean up after ourself. And the package is reproducible. So we can also approve that by running lsp release that package and then the build package. And it's the same checksum. And that's sort of how you go about making arch packages reproducible and sort of help. So if you're interested in these things, you have the arched instance reproducible on FreeNode and you also have the reproducible build channel at the OFTC. We're soon starting weekly meetings again. And then you can hear about the progress that's being made in the project and in all of the other stakeholder projects. This was my presentation. You can find me on almost any RSC network. As Foxparon, I have a web page and a blog. I'm on GitHub, of course, and I also have Twitter. If you have questions about this talk in the future or in general, you can send me an email or you can come to me on RSC. And I hope it was an interesting talk. Thank you. Awesome. Thanks for a fantastic talk, Fox. My name's Secret. I will be your question host. Yes. So the very first question is, why do man pages need a timestamp from last? So I think it's a legacy thing. And it's sort of nice to know when the man pages were built in terms of the package, but it's mostly legacy thing. And many, many built systems does not account for it. So it's a bit of annoyance, but it's a little bit handy, I think. OK. How are these two environments set up? What are you using to keep them isolated from each other and the host again from last? Yes. So when we build all our packages, everything is just system D and spawn, which just builds the container. And we just insert the base packages needed for building. And then you do the same make package dash as process that you're familiar with from AOR helpers and stuff. And then that's mostly it. So we sort of have a root container prepared with all the base packages, and then we clone it and do the package building in that one. So whenever you do multiple builds, they'll always be from the same root container. Yes. Right. Eval asks, how are you sure about clean environments, package dependencies, et cetera? So I sort of think that's part of the, if I understand it correctly, it's part of the DBScripts portion of uploading. So we can essentially go peak at the building for file. And you can just read the list of install packages. And you can easily spot when there's AOR packages in the install file, or if there's some of your own build packages, which slash git prefix, which is present in the building for file. And then you know that this is not the clean build and is possibly being polluted by some external stuff. OK, cool. Steamreg asks, how realistic is 100%, I guess, 100% reproducibility in core and extra? And what is the minimum percentage you're aiming for across all repos? So if you consider the fact that Linux, for instance, embeds sign in keys for modules, we can't have 100% reproducible core. We can, however, have all expected packages reproducible and then some packages that we expect are not reproducible, which means that if we fix the packaging of Linux, we can sort of split out the sign in keys and have Linux reproducible, and then one Linux split package, which is not going to be reproducible, but we sort of know that. So 100% probably not realistic, but we can then consider some blacklisted packages and then have rest of it reproducible to 100% theoretically. Right. And so there's no minimum percentage you're aiming for. It's just what? Well, we want 100%. You can achieve. Just for Linux packaging, really. Yeah, it's just a bit annoying, I think. Orhan asked, which terminal on the shell is this? So the terminal is termite, which is maintained by Llela. I'm Archdev, and it was written by David McKay, which was an arch Linux to you. And the shell is a seashell. And I just edited the prompt to make some sense for the demos. Yes. Yeah, stream-specific, that's it. The next question is from me. Secret, are there security implications in making packages reproducible, i.e. losing randomization and hash maps, things like that? So I think it depends a little bit, because the naive answer to this is probably yes. You will be losing a little bit of security because you sort of have to split out signing keys. You can't maybe have static signing keys and stuff. But I think, overall, the security you lose is not that important in contrast to the supply chain integrity that you gain from it. And if you have some block list packages or some package you don't expect to be reproducible, you can sort of get away with a lot of the averse reproducibility issues. In terms of the Python hash seed, that's originally done to prevent boss attacks on. Yeah, do you want me to read that? Because that's like a follow-up question. Oh, OK, I should. Yeah. KGZ asks, the reason for the Python hash seed being randomized by default was to prevent hash cli- hash cliz and denial of service attacks. Is there another mitigation for that, or was that a trade-off in favor of reproducible builds? So basically, it's actually kind of the same sphere, right? So I do not recall the justification of hand for the hash seed. I think this is mainly a problem for when it's a security implication that hash collision is a thing. But for most library packages and Python applications on Arch, I don't think that's going to be a huge issue. It's mostly on sort of your production-facing deploying builds where that might be a security implication. So I think there's a trade-off here between user packages and then production packages, essentially. OK, cool. Katjeffel asks, sorry, I forgot your name wrong. When the build fails, do you chug the rest of the beer? That would be a hilarious drinking game, but not currently. I think it's the other way around. When you chug the rest of the beer, the build fails. Oh, that's going to be a hard one. Ebal asks, do these patches make any sense to provide upstream? So we actually publish a lot of these patches upstream. Open SUSE with a Bernard Wilderman vitamin. I forget his last name. He does a lot of patching upstream to fix these issues. But it's sort of natural that these patches get applied to the build package first, and then we verify it works, and then we usually upstream all the patches. And that's sort of a recurring theme in a lot of arch packaging is to just repatch something, then we submit it upstream as well. So both of this hits upstream sooner or later. It has been, some of these has been given to upstream, but they might not have been pulled yet in the release. Can you give any examples of those that you have upstream? Oh, I have done probably 20, 30 patches upstream to go projects. A lot of the container ecosystem, podman, container, like a lot of those projects personally. But we have done gdlibc patches, Pacman has gotten patches and stuff, so it's quite a few, but I don't forget it. The reproducer builds monthly report actually has a list of all the patches that are submitted and fixed upstream. So you can actually keep tabs on how many patches land upstream and what is currently being worked on to get upstream. Awesome. Do you have any way to provide exceptions to reproducibility? For instance, when you don't want to patch out timestamps? We don't. We can probably blacklist packages in our rebuilder setup, but we currently don't have any formalized way of doing that currently. So by blacklist, you mean it won't even attempt to try to? Yeah, like our current rebuilder system uses three nodes, but none of them are super powerful. So if you try build TensorFlow, which takes 10 hours to build, that's that's simple and reasonable. So that package, for instance, is blocked from being rebuilt on our rebuilder system because it would just totally kill a few servers for a day or more to try to reproduce it. And you can sort of assume it's not reproducible. Cool. So that question was asked by A Boleman. I forgot to read the name. DVZRV David asks, comparing large container files, EG installation images using Difascope is currently an issue, crashes. How does this currently impact large packages, EG game data? So I think both the Linux documentation and the Linux package is hard to debug because Difascope does not like to produce a diff for them because there's a lot of both binary files that need a diff and there's a lot of different resources. So I think running Difascope on that takes quite a while and sometimes crashes. I think that's part of an optimization thing. I'm not sure how upstream it does about that. Hopefully getting fixed because it's a little bit huge issue on larger packages at least, but it works really well on small ones. Nice. KGZS, what's the end game for reproducible builds in Arch? If a package starts being reproducible, how can you block all new uploads that don't reproduce? So I think the end game is to have this integrated more deeply into the sort of package upload. So you actually, if it's not reproducible, it won't be published to users or at the very least you should decide to say you don't want the package. But considering a rolling release system, library updates, it's a bit hard. But I think the end goal is to try have reproducible builds or unreproducible packages imply that you can't upload the package, but we're attracting a long way away from that. I guess that also kind of introduces an issue with regards to receiving security updates. If a security update is needed for a package, doesn't it break reproducibility? You have to make the call for... That's a tough one. I'll let Antrax think about that one for a little while. Ramzi asks, are there any guards that prevent a reproducible package that could not be reproduced from it? Yeah, so that's I guess the same that could not be reproduced from entering the repo. I'll just read the question. Yeah, it's essentially the same thing. What happens if a previously reproducible package becomes unreproducible? Of course we want it to be a thing, but we're very far away from actually having this guard in place. It would be very fun to have it though, and I think it would be a big improvement on the current packaging and supply chain of arch. Stefan, OXC asks, or 0XC, a few packages in Arch Linux are not built from source, but just repackaged binaries from upstream. Do you think the packaging guidelines and change of reproducible builds become more important in the future, as Antrax has said? So that is a problem that we do it. I think we do it on some Java packages because alternative is extremely tedious. There's probably also some legacy packages that has been doing that for years and nobody has bothered to unpack it, but that's not necessarily a reproducible builds issue. It's a general packaging quality issue, and we do fix it. There are bug reports on several of them, and a lot of that should be fixed honestly, both in terms of reproducible builds and also for the general package quality of Arch packages. So there are some, hopefully we'll fix them. Please send patches. Patches welcome. Yes, patches welcome. There's a lot of Java stuff that's very tedious to get correctly. It's a bit bad. DonaMator asks, where can I see slash get updates on if packages I maintain are reproducible? We have an internal dashboard. I can't show you that one. I think Jelle can give you a link if he hasn't already done that, but there's an internal dashboard that on Arch Web for packages that enables you to see unreproducible packages. But currently it's only internal and the public can see it, so I'm showing it, but I can link you afterwards. Awesome. Sorry, I apologize if people can hear church bells in the background. And finally, this is the last question, you1106 asks, so if one of your dependencies or your source were polluted, how would you notice that? You build twice using the same polluted thing and they're perfect match. So I guess that's a question about generally having poisoned upstream packages. So if we go a bit academic on this one, so that's called a trusting trust attack and it's sort of, it's the one, Ken Richie, Ken Thompson, I think it's Ken Thompson. Fuck, I forgot the name. It's Ken Thompson. Yes, Ken Thompson. Trusting trust attack and it sort of devolves that how can you trust your compiler is actually outputting the correct binaries. And the question for that is that we can't. This is partially solved by diverse double compilation, which is sort of explained by David A. Wheeler's academic thesis. But this is more akin to the bootstrap ability issue of packaging and not that much of reproducibility. So we can't detect polluted dependencies, but we hopefully in the future after reproducible builds have some form of bootstrapable builds where we can bootstrap all of the dependencies and then also provide some confirmation on the package. That is being worked on, but it's admittedly a bit younger field than reproducible builds during Marrakesh in 2019. Using the MSC compiler we managed to reproduce GCC the same version across several distributions using different GCC compilers and we wound up with the same checksum. So that's a huge improvement on the ecosystem and I think it's a nice improvement on what you currently have. It's obviously just GCC, but if we can do that to GCC hopefully we can then install the compiler in other languages. But it's still being worked on and stuff. I don't think that GCC compiler was actually reproducible in Arch, but it was across Debian, Goix, MixOS and a few others. I don't quite remember the details. Awesome. That is the end of the question so I guess I'm going to ask one more. What got you interested in reproducible builds is the thing, how did you get dragged into all this box? All this accidental. This is me. I'm at Antrax Yeller, Shibumi and Remy Kagogen at CCC in 2016 and then they introduced me to the security team, which all of those also worked on reproducible builds so they introduced me to that mind-blowing concept of if you build twice, do you get the same checksum and you realize, no, you don't and that just spiraled to this presentation three years. The Arch security team has been working on reproducible builds for at least as far back as winter 2016. Yes, it's mostly the same people. It's mostly Antrax that has been the driving force and been pushing forward to this and got all the people interested along with Allen and stuff. It's been quite a few years now actually. That's fun. That's really cool. I know it's a subject that's fairly close to my interest so I look forward to seeing it develop as time goes on and I'm sure it will. I think there's another question. Oh yeah, sorry there is one more question now. What are your hobbies other than Arch Linux and reproducible builds by Bullsock? Do I have other hobbies than Arch Linux and reproducible builds? I love bear I enjoy fasting, going to Belgium and drinking more beer obviously. I do enjoy music, guitar. I have a lot of guitars. A lot of open source development and stuff but it's mostly free, softer focus most of my hobbies really. I do travel and photography though. Awesome, thank you very much. Yes, awesome. Yes. Cool. Bye bye. Thank you.