 to produce the builds and especially the unexpected benefits and problems you have with these. But first, introduction of me, who is that guy talking to you. I'm Bernhard Wiedemann, working for the company Sousa. We are doing Linux distributions for quite a while, nearly 30 years, you know, and also some other stuff. But I'm involved with reproducibility builds since six years, and since then I did over 1,000 reproducibility pages to help that project and also contributed upstream with the project. But why does it even matter? This is a problem that our machines understand only machine code, that's binary files. And it's really hard to review these binary files. Not impossible, but not exactly easy. So most of the time in the open source world, we review sources of these binaries. But then there comes the additional trouble that we must ensure that these sources are really what produce the binaries that the users actually use. And that is where reproduced builds come in. So what is that? On the basic, very simple level, we build the sources twice and check that indeed you get the same result twice. That's a basic requirement. And after that there's some more to do, but when you can do that, then you have sources that can be built reproducibly. And when I tell people first about this concept, there's two different directions. One is, yeah, sure, computers are deterministic, so it should be very trivial, what's there to do even. And the other direction is, oh, we have such a complex program with a thousand dependencies and all that stuff going in the end. It can't possibly get reproducible, but the truth is also often somewhere in the middle, problems with reproducible builds. Why would the binary vary? If you have worked with reproducible builds a bit, then you will have encountered time stamps. Because a lot of people want to know when was it built, or maybe they don't actually want to know when was it built, but they want to know what version is it. And so it's very easy if you have the time stamps and you know, ah, so it's the version I compared yesterday. Of course, if you compile a new version, it will have a new time stamp and binaries will vary. Hostname same thing also happens. Then you have file systems like the X4 file system with a dear index feature. And the dear index feature uses some random hashing with a random seed. And that means if you do a list of files in a directory, then they will have random order and that comes through the fine program, through Python globs and see reader functions and all of these functions will deliver random order and that can influence the outputs that happens. Then you can have race conditions. That is if you use make minus j, that's very easy. You create a file, then another process comes around and recreates the file in a different way and depending on who is faster and sometimes results in one version or the other in the output or different variants of that. And you can have even compile time see view detection that is very common in the certificate community. People say, ah, we have these scalar vector AVX2 instructions and of course we want to use them. So they compile with minus m native minus m native and that means a compiler will check what the build system has for CPU and use all the features available. But then if you build it elsewhere with a different set of CPU features, that luck different binary can't reproduce. So that's also not uncommon and this one can even lead to bugs. When you have a build farm with different machines, sometimes you get binary set, users can run older machines and sometimes when you build on your machines, you get instructions that are not supported on older machines and that you will get the illegal instruction and then the program will terminate on these older machines. So not good. We patched it out. Try to use always the same instructions. It's fine to even always use AVX, but then at least you will know that it will work every time the same. So, but then we also have surprising problems. I think nobody really expected. And that is when the stock gets interesting because sometimes the packages that you can't really easily make reproducible because it varies because of profile guided optimization. So what is that? And that is when you want to squeeze out the last bits of performance. And to do that, you build your program once with profiling options and able to add some extra code into the binaries. Then you run your profiling run. That will produce some output and GCDA files containing some encounters which branches were taken in how often. And after that, you compile your program a second time using that extra output and that helps the optimizer to put the right branches into the fast path and the action exceptions will be put into the slow path. But of course, if your profiling run varies even for very, very bit, even for tiny bit, then the optimizer will produce different output. And that means not reproducible. So you need to make sure that it's really the same and it can be really sensitive. I remember when I looked into the gzip package, I gave it the same table and it's the produced variations because the same table was under slash temp and a random mk temp file name. And then we still got variations because there was a two lower function in gzip and the two lower function sometimes saw more uppercase characters and sometimes less uppercase characters. And that meant GCC produced it and optimizations. And I really had to give it that table over standard in and then it couldn't see the random file name anymore. And then it became reproducible. And so that's really how sensitive it is. So and then we have other packages like GCC and Python. And they have really a large comprehensive profiling run. For example, GCC as a profiling run builds all of GCC. So it has really exercised all of the different code passes. Yeah, but it's very hard to make that reproducible. So at the moment GCC is only reproducible if you disable profiling. That's easy. Without you lose like 8% performance then, but if reproducible reproducible results, that's a tradeoff you have there. And the other tradeoff is sometimes with security. That's also a bit surprising because we do reproducible builds to get better security. But there was, for example, the lip camera package and lip camera uses random GCC random GPG key created during compile time. And that key is in use to sign modules. So later at runtime, it can see which modules are really compiled as part of the main build and it gives them extra permissions because it knows it can trust them. And if someone tries to sneak in extra modules later, it will see, oh, no, they are not signed by the random private key. So what solutions exist to that problem? Could add the private key to the sources and declared part of the sources and then we could use the same private key and make reproducible signatures. But that somehow defeats the purpose of signatures because then everyone can use the private key to sign the extra modules as well, not good or there's this other package called Chim. That's a very small bootloader and we build it once and then we give it to a third party and let them sign it. And then we put the signatures as part of the sources and because the binaries are reproducible, we can then at the end of the build just add the signatures next to it and to match because it's the same binary. And that works for the Chim package because we change it very rarely, but on the other hand, it defeats the purpose of source because then you can't just change the sources and add small patches to fix issues. So also not that nice. That's possible trade-offs you have there. Or we say, yeah, it's not reproducible, sorry. And just ship unreproducible binary signatures. Maybe it's even possible to have a special program that strips off the signatures and then you can prove it's still the same binaries after the stripping. So next up, let's look at what surprising benefits we have. So when you have reproducible builds, you can build a binary twice and get the same binaries. Then you can also use diverse double compilation as a counter to the trusting trust attack. So what's a trusting trust attack even? It's when you have a compiler and there's a vector in the compiler. And that vector means that when you compile the compiler again from source, it will re-add the vector into it. And it's hard to get out the vector. You even see it because the vector is not in the source, but it's in the compiler you use to build the source. And when you use diverse double compilation, you can use two, three, four compilers and all build the same target compiler and then use that target compiler to build again the target compiler. And it should lead to identical results because the intermediate compiler should be functionally identical. And we even have done that three years ago in a small project called in DDC POC as a proof of concept for diverse double compilation. And we used the tiny CC because that compiles really in 20 seconds. And that's fast, easy, and it's self-contained. It's not perfect, but it shows that it's possible there. Another benefit is that it can reduce load on the build service because in the build service we track dependencies. So if you have a change in component A, some library and maybe it was not a big change, we just changed the read me comment in a source code or something. Then it doesn't know that it doesn't affect other things. So it tries to rebuild other things, but it will see that these other things that dependent on the library didn't actually change. So it doesn't need to republish them. Doesn't need to waste bandwidth pushing to the mirrors and users don't need to pull a new version from the mirror. So it saves a lot of load on all these components. That's nice. And when you have republished the builds, you can even find bugs that corrupt data at compile time. And there's a few listed here on openSusr.org and one was even collection of bugs on the mailing list. That I shared there. So, and the fun one was the bash one. We found that the documentation had a wrong string that should read bash, but it actually said BHH and that turned out to come from a strict copy from overlapping regions. Where the documentation says no, you shouldn't use string copy from overlapping regions. And there was even a commie and comment. And the comment said, yeah, we should use my move. So that was a patch even. So this kind of bugs you can find and fix. And then there's even more things. You can benefit from and recollected that some years ago on that page, the Bayan page on the Reproduce Builds project. So if you're interested, you can look in there. And next up, you want to look at how to debug. For that, I have written a whole document called how to debug. It's a bit specific to openSusr, but the general steps are the same, depending, independent on which distribution or which set of tools you use. So first, you set up the tools and you use them to check, is there even an issue and if there's an issue, then you go in and debug to find the source of the issue. And once you found the source, then you can fix the issue. And in the end, you submit the issues. And when we look closely at the file, I wrote a tool to do a rebuild with KDM and that can have a custom level of variations. It can vary the file system read the order or it cannot. It can build on the same day or it can build 60 years apart. And like these things or with address based layout randomization. And there's another tool called Nachbow, which is doing a replication build. So it takes the official binary build and tries to reproduce it as closely as possible. And it should produce similar binaries, except our official binaries don't normalize the M times currently. So we don't get bit reproducible binaries so far, but close. And once you know, there's an issue, then you go in and debug the issue. And for that, I have a tool called AutoClassify that will build binaries with the least amount of variations. And when that's how it becomes reproducible, then it will change bits to zero and more bits to zero. And at some point, it will reach a point where it becomes unreproducible again. And then it will let the one bit stand in there and turn the other bits to zero. And in the end, you see, oh, that's bit number seven. That's parallelism, maybe. And then you look for issues with parallelism and how that can influence the result. And a tool that comes in handy there is called auto provenance that uses a trace output. They built your program under S trace and it will see what files get written by which program and what folks happened. And that sort of gives you a call trace to see, OK, this make minus j called this program and that. And then it called GCC again to produce the output file again, and then you can see, OK, that's a race condition there and fix it. And there's a lot of other ways to find performance that, for example, after build, I create a diff of the build route. And in that diff, you can look for C files and H files that differs or even configure output and make files that created were created as part of the build. And once you see these differences, it can get really straight forward to find the route course of the issue. Sometimes you can just look at the build look, especially if you have for both options enabled. You do that and see, OK, this one build did that step and the other did not. And then you go in and check why was this difference there? Or sometimes it's the ordering issue. And you can see these things in the build look look the right way. Or maybe you see in the output, there's a string and it has a date in there. And next to the date says this file was outweigh generated at and then you grab the sources for this or you just grab for percent Y or year because output has a year in there or other typical strings that appear like a Python Glob function or the walk or list functions. That's very common. So depending on the type of issue you see, you can look for different things there. And even do manually strays and of course it produces a lot of output and it's not exactly easy to read. So if the other tools work, that's fallback. So and once you found the issue, you go to fixing the issue and that can be really simple like with date issues. We have this thing called source date epoch. So you can patch the code to use that one or even better, you just omit the date. Just depends on how good you are in convincing the upstream to do that. So that would be dropping the other elements. You can drop the hostname of the build machine, drop the user that build it. Some projects even capture what was the build CPU and it shouldn't matter ideally. So if you can drop it, drop it that is sometimes a very good approach because it reduces the complexity of upstream. You can normalize like the source date epoch or you sort lists that were reproducible also from hashes. Sometimes you have hashes with random order and you sort them when they're used or you have these actual bugs like uninitialized memory, you just initialize your uninitialized memory. So in the end, you have a patch. So what do you do with the patch? If there's an upstream and it's active, then you get it upstream and that works even in like 50% of cases. Sometimes there's no active upstream. Then you have to forget carries a patch downstream and your distribution packages. And maybe there's other issues, upstream disagrees on the solution. So then you have to go back and maybe make a nicer patch and all this costs more about the solution. So at some point you have a patch in, submit the patch. There's different ways with GitHub, GitLab. They're all different. Some projects even use ARC. There's more to do. Macquarie, that's very rare, but used in a few places. And some projects have mailing lists or even just individual orders. That's very common in the scientific community. And then you just send the patch to the main author. And if he doesn't react, maybe you find other people who know how to get patches in there and GitLog is really useful in there to tell you who to contact. And finally, what good is it to patch if you can't tell people about that. So you added to the reproducible reports and people can look at the patches and maybe even improve them or comment on it and you get your small bit of fame there. And sometimes it takes longer for patches to get merged. So it's really useful to have a list of them so you can revisit them later and see if there's still something to do for you to make changes signed off by these things that are missing and contribute to your license agreements. So and sometimes they don't react to you ping them again later because things can fall through cracks and people don't always have time. So it's good to remain them once or twice later. But sometimes it's just not the right path to reach people through different ways and email or use ISC and find other contacts and somehow there will be a way to get your patches in. So that it's for patches and mostly for the main part of the talk and then would be the end of the video and opening two questions and answers. What are your questions? And thanks for listening so far. The platform or if you are online on the virtual platform feel free to submit your questions there. It is via text only.