 in material after the initial build and use them to debug the software. One other thing that we want for Debian is that actually we want to be sure that we can build packages from source. That's mandatory. And so if we say that packaging need to be reproducible, to have reproducible builds, then we can make sure that actually that is happening. Multi-hack same has to be exactly the same packages across, no, sorry, multi-hack same packages if they contain files with the same name. They have to be bitwise indetical. So also while making a package build system deterministic, we help the multi-hack crowd or visitor. There's also the idea that if you, from one version to another of a software, if the build system is deterministic, maybe the .dev will not have changed that much. And so between one version to another, then it might be pretty similar. And we might get smaller deltas and faster upgrades and less bandwidth space. Another users of deterministic build will be the build profiles mechanism that it's getting in. And so the build profile is an idea, like it's helpful for bootstrapping new architectures. That's one of the underlying reason. And the idea with build profile is that if a profile A build a couple of packages, and if a build profile B build a subset of these packages, the subsets need to be also identical in feature to the other build profile. And so with reproducible build, then we can make sure that they're actually feature-wide identical because they would be bitwise identical. And maybe we can find other reasons. I don't know. There are many interesting aspects. So how did this start for me? So I'm active in the Tor project. And during past year spring, Mike Perry and other people working on the Tor browser work on making the Tor browser build process deterministic for one of the very reason I explained before. That process was partly inspired also by the Bitcoin crowd because they are shipping binary that handle valuable assets of people. And so they wanted to be sure when they had to hand out a binary to someone that someone says, yeah, you're not stealing my money, or whatever this cryptocurrency thing is. And so we have that for the Tor browser. And it's been very interesting in how it changes some of the development processes. So for example, once someone in the team tagged a build, then other people in the team will also start running the build. But there's only one person who will actually upload the gigabyte of multiple versions of packages or server. Everybody else will only upload a single sign signature of the SheaSum file of the checksums on the browser. And so you can actually have one system that you do not trust that much, but with a very fast internet connection, uploading data to a server somewhere. And then on your own laptop, you reduce the build. And you have a stack, you're on a train, whatever. Once the build is finished, you only have to compare the checksums. And if the checksum match, you know that it's OK. And if multiple people do the same process and you have far more trust than there have been no compromise in the process, because, well, we all got the same results. And it's highly unlikely that everybody got compromised at once. But this is not a new idea. So someone, once after starting the process last year, someone, Martin Ucker, wrote to me and pointed me at that email from 2007. So that's seven years ago, on Debian Devolve saying, yeah, I think it would be really cool if Debian policy required that packages could be rebuilt identically from source. At the moment, it's impossible to independently verify the integrity of binary packages. Well, that was seven years ago. The reaction were not super enthusiastic, also probably because Martin was not part of the established Debian crowd. But for example, Neil Williams, who's in the room, no, said, why? I've seen a benefit. Well, I hope you will see benefits. And Manoche, who's also not in the room, but still, said, well, I think this is technically invisible. But hey, I'll be happy to be proved wrong. So I would be less proven wrong. So what happened is after the example of the tall brother, I scheduled a really last minute buff during DEBCOM 14 in Switzerland. And I was very surprised because there were like 30 people showed up. And we had an interesting hour-long discussion, well, 45-minute discussion, which was short, but still. And that kicked off the Wiki page, Reproducible Builds. And so, well, the Wiki page tries to gather many information, so long page. If you want to help make it better, please do. But mainly, how do you do Reproducible Build? What are the steps that we need to do that? So it's pretty simple, actually. One is, you recall the build environment. So you know what tools you used to build a specific package? And then when you want to, you need a way to reproduce that build environment. So when you want to reproduce the build, you start by setting up the same environment that was the initial environment. And then you need to eliminate all the unedited versions that are part of the build process. And so recording is actually fairly simple for Dibion. We have packages. We have, they have versions. So if we install the very same versions of the packages that were installed, where we are very likely to be in the same build environment. Reproducing the build environment is also fairly easy for Dibion, because we have Snapshot. And Snapshot saves every binary version of every package that I have entered the archive. And so you can actually take packages from there, and you get an environment that is very close to the initial environment. Then there is all these variations that are captured by the build systems where maybe they are not going to need to be captured for getting a final software. So timestamps, everywhere, timestamps. So you create a file, and then it's the time where of its creation is recorded. That's what DPQG does, for example. So let me tell you, I'm willing to lead collective timestamp fan anonymous sessions, or whatever, like a support group, if you want. Timestamps in the build process are not a useful information to capture. Please trust me. What is interesting is, for example, what source you have used, the environment that you've used, that the timestamp may be of the last git commit, but the timestamp of the build, no. If I take an old version and I build it now, it doesn't interest in capturing that information, really. Maybe I've made that as not as part of the software. Support group, I can do support group. The other piece of information that gets captured is the build paths. And that's really annoying. It's where you actually type the make command, gets into the final binary. File order might get captured. If you concatenate, you do concatenate, start something, and then depending on when they were written in the file system, you get different results in the final build. I don't know. We can, yeah. Coin. It's worth noting on build paths that those are security hole occasionally as well. Occasionally, we find that people's paths and developers' home directories are hard-coded into packages, and that the package will care whether that thing under slash home exists, so that's worth nuking for other reasons. So for example, you have Russ. You want to add something? One comment on the timestamps, the one place where I do know that that is used, is it's used by GDP when you're actually debugging a binary so that it can tell you that the source file is newer than whatever went into the binary. I suspect there's some better way of storing that timestamp than the build timestamp, but it is used there, I think. I'm coming to the build. I'm coming to the deball symbols. That's my major headache. And so local, for example, get also captured. For example, the GNU sort command line utility will sort file differently depending on your local. So for example, Gzip, to give you a couple of examples, Gzip by default will start a timestamp. Yay, super useful information. AR, TAR, ZIP, they all start timestamp. And for deep end, I mean, most of them will start a timestamp that is pretty useless, because this is the time of the build. It is, you've just created a new file with GCC and put it in an archive. That timestamp is not really interesting. Java doc writes timestamp in the held file. Why? So that's the major headache. So Dwarf, so you have ELF, and you have the Dwarf, which are debug symbols. And so in Dwarf, there is the build path of the source code gets captured, which actually is annoying because then it means that if you install the dash dbg package in dbn, then the source path is not right. You have to feel all of it. I'm coming back to that later. Go for it. So we currently do builds inside of fake root. Perhaps we should start doing builds inside fake time or similar, and all these are turning the app on. Coming to that. So for file order, we have readgear, which returns file in the other of the file system. For local, so that's why I told you that source varies between French and the C. If you have an accent letter, but there's also other examples. Other information that might get captured, the else name of the system, super useful in a builds binary, told you, the uname output. Well, for some cases, but really, I'm not sure, the username of the builder. No. So we could cheat. That's the way the top browser did it. They use a VM, like that's the Gitchian thing, the Bitcoin people also use that. They use a VM. And everybody who's building the thing gets the same virtual machine image. And so they have the same kernel, they have the same user, they have the same build path because it's basically the same file, they share. And for example, they also use tools like libfaketime, which fix the time. But so that's one way to do it. The other way to do it, and I think is more correct for GBN, because we do not craft our little solutions in our kernel, but it also would be great if every free software distribution would have reproducible builds. So we fix the bugs, which I consider that in the build process captures some data that is not useful for the resulting software, then it's a bug in the build system. So for example, we can configure the tool chain. Binitils has an option that is called, when you do the Dart Flash Configure, enable deterministic archives, which we'll actually have the AR command, not record user ID and timestamps. So that would make it the default. We need to patch software like Javadoc, so they get a minus, minus node timestamps. And we could eventually, we can individually patch build systems, changing the build system so Gzip has the minus n option, which will not capture the name of the file in the timestamp. Paul? So just looking at this stuff, it makes me think of upstreams, and that makes me think of the upstream guide and the bof that's coming up later in the week. And yeah, so we could add some advice for upstreams. And maybe do you think the proper place to do this is fix the build systems themselves, like Autoconf, CMake, or is it problems in the separate configuration for those? So for example, Binitils is tool chain. So it's our responsibility. Gzip is often called, in the GBN rules, when we do the main patching. So what about Javadoc? I was thinking mostly of that one. Well, we could try to make Javadoc by default, not start timestamp at all. But that's the fight with the Javadoc upstream. That's the fight we should put. But also, that's something that more and more upstream are getting interesting into being deterministic, the build system. I know that it's not much, but the HTTPS everywhere extension for Firefox is, the build is reproducible. But it's very useless for DBN, because the DBN package itself cannot be reproducible. So in order to toy with these ideas and thinking about also what Manage said, what we did is an experiment. By building and rebuilding many source packages, we used the EC2 VM things. I'm not clear on the details, but David Suarez made all the magic happen. And thanks to him a lot. And so the idea of the experiment is we build the package twice. And to do so, we set up a clean truth, we unpack the source code, we install the build defense, we build, and then we do the exact same thing. One slight difference is that we passed the timestamp of the first build to DPKG through an environment variable. So in that context of the experiment, we make two variations. The time of the build is different, because I have no late-pack time library or such. And the build path will be different, because the build is done with sBuild, which will pick a random path every time you have a build. But there's no changes, no variations in hostname, or username, or uname, or I believe the file order, because we unpack the package the same way both times, or the local, which is, I think, C in both cases. But still, it is a pretty good framework to start evaluating. And so for the second experiment we made in January, so the changes from a normal DBn build environment system that we did was, so there's a patch for DPKG that uses a single timestamp for the whole archive. So the .dev are made of tar files. And so the tar files have a single timestamp. And we can pass to DPKG an environment variable that will make it write the exact timestamp in the archive. There are also sorting of the files that DPKG put in the archive, so we'll always get the same file order. And so to get to fix the build path problem in the Dwarf files, we used a tool that was designed by Red Hat called a debug edit, and which can change the path in the written Dwarf files after the build. And so we hooked that into a developer strip, DH strip, which also produces most of the dash DBG package. So that felt like a good place to pass it. But there's a trick that we need to pass the minus F no merge debug strings to GCC, because otherwise debug edits can't work well, because hash table order will get them, details, painful, blah. And also binatures have been rebuilt to have the enable deterministic archives. So that's the experiment we did. We built 5,151 source packages, and that produced 3, 4, so for 3,196 of them, we had produced identical binary packages. That's 62%. Is there a rationale for the set of source packages you chose? Because I went to look at it to see what the state of mind was, and almost none of them were there. No, they were fixed at random by David's systems. So it is not actually super representative, but it is to say that it's not a crazy idea. We can do this. And maybe getting to 100% is a crazy idea. Or we'll take several decades, I don't know. But at least getting to 80%, 90% doesn't send unreachable goal. To give you a couple of examples with that set up, Find The Chills was reproducible. WGet, Calypsoe, Busybox, Python support. That's a couple of packages that were working well. For the failures that we identified in the resulting, the remaining packages, the top failure is the Dwarf file still having a mismatching build ID. So between the two builds, which probably is because the build path. So either the package is not calling DHTrip, or the build path is encoded in a way that the trick we used didn't work. So JAR files, there was a problem in Haskell files. There was a problem in PHP registry capturing build path. GZ timestamps. Mono, that's something with Mono going on. I haven't, I don't know. There was a dog book to man, like a timestamp in there. There's a couple of also other like hanging fruit, like low hanging fruits that could benefit a lot of packages at once. But right now we still have no good solution for the build ID thing, the Dwarf. So one idea that actually, Steve and John, you came up on the reproducible main list is, OK, let's stop craving DHTrip weird things. How about we agree on a canonical build path? That will solve the problem, Colleen mentioned, about random paths getting in the files. And also we have GDP, we'd have a canonical location. So it would be easier for DBN users to just have to get source in the right location. And when they will run GDP, they will have the right path already set up for them. And there's a tool that is called P-root that can actually fake the current directory of a software. So you could eventually build the software in whatever directory you want, run dpkgbuild package. And it will actually, in the background, like fakeroot does, fake the current directory so you get the canonical one. But we will change, I don't know, the sbuild and pbuilder, so they will use this canonical repository every time. P-root has downsides in that it uses ptrace, so it's not available on architectures. So that might be a problem. I don't know. And it's unclear how we push changes like that. It's odd, this to me, if they're like super old timers of DBN who tell me how we get to decide on a canonical build path, please don't answer me, do a GR. But that would be an idea to solve that particular class of problem all at once. One other idea might be contentious would be to have dpkgbuild package export the environment-viable jizip, which are options for jizip, and pass the minus n option by default. Mainteners were not really happy when dpkgbuild package started to push C flags and all, so maybe this is contentious, but they would solve a lot of packages at once. I don't know. It seems like that would fall afoul of a number of upstream build systems that go out of their way to not be affected by environment variables like jizip that you may have set as a user. Well, Ben, you can fix them individually, but we can swipe a lot of apples. I won't see if we did that. I don't know. Russ, had a question? I'll come in. I'll get fit. Is it possible to strip the timestamps out of jizip to compressed files, even if they were originally compressed without dash n as a post-processing step? Because if so, you could just throw something in the debhelper that goes through all the compressed files in the package and just removes the timestamps. Good idea. I don't know if that's such a tool, but that might be easy to write, actually. So yes, would you do that? OK. Oh. Interesting. I had an answer for the default build path that was not a GR. So user SRCs should be read-only, according to different policies I could read. And Pbuilder throws stuff in var cache, so maybe that could be used instead. And I had a question regarding the timestamp. When you fake a timestamp, which time do you choose? So you do a first build. And you capture the build environment. You capture the time of that build environment. And that's what we use as a reproduced point when we need to write the dot-deb of the second build. But the time of the start of the build or the end? No, well, the time, whatever. It's using the time of the start of the build right now. But the point is, leapfake time has become really powerful. You can actually record all calls to get timestamp of the day and have it replay them. But I don't think that the way we should do it in DBM. I mean, we could also agree that why it is done that way is actually to please Gillem Jover from DPKG, because the initial discussions were, I want to keep timestamps. That's what he said. So maybe, I don't know, maybe we have to go to the technical committee to say, OK. It's not useful to have a timestamp in the dot-deb archives at all. So I'd like to. Yeah? Yeah, question on that. So that's good enough for the use case of us testing that packages are reproducible. But I think the other use case is enabling user to reproduce the exact package which are in the same archive to ensure that we are not making anything fishy. So how do you imagine that? We produce the packages, and there is a way to extract the packages for the user to apply. So my place. So I have a couple of crazy ideas. But can I go back? OK, cool. So just to finish, I wanted to make a new experiment using the DPKG build package that would call P-route. But that is why I was too busy, so I couldn't. I'm really sorry. I would have expected something like 80%, but I don't know, maybe 90%. I don't know. There's other distribution interesting to that. So there's Fedora, who had one of the security people writing a blog post about this after we started the DBN initiative. To best of my knowledge, there has been no further progress in that. OpenSUSE has something interesting called Build Compare, but I hadn't time to look a lot about it. And the distro called NextOS is super interested into reproducible builds, because they have an interesting system where they capture the environments that were used to build a specific package as any of the dependencies. And so if they got reproducible build, then it means that when you upgrade a specific package, you have to rebuild far less. Because if a build dependency hasn't changed, but hasn't changed its behavior, then the checksum at DN keeps staying the same. So there were interesting cross-collaboration. But so far, I'm under the impression that the main efforts on this have been in the DBN crowd or friends. So before I get to want to help, crazy ideas. The one thing, first, there's this idea that we should not directly pick binary packages that are well built on the developer system and put them in the archive. We're getting there slowly. Ensigal worked on DAC to move forward on this. What I'm advocating is actually we make so developers are forced to build binaries. But then we add the .dev. And so the .changes far contain a checksum for the .dev. And you upload everything to the upload queue except the .dev. And then you have a buildee that would pick that changes and all the effort and file to rebuild the packages from source in the same environment. And if the .dev actually that has been produced matched the original checksum, then it gets in the archive. That idea, that means sparing our internet connections, which would be good. But also it means that then you have reasonable trust that your system is not compromised or the buildee is not at the same time as the buildee is compromised. That's better, yeah. Of course, you do have a problem with multi-arch and different architectures with that. Because most of your developers only have one architecture to play with or two. So you only got faith that the architecture that your developer has uploaded for hasn't been compromised and not a sub-architecture. Yeah, sure, but we have general trust that the buildee are not easy systems to compromise. Or we try our best so it's not the case. We can also have multiple buildees in multiple geographical locations and cross those. In addition to just having multiple buildees, you could also have the standard buildee system work as is now. And then other people who want to run reproduced buildees can still pull the same things from Snapchat and attempt to reproduce these and then verify that for themselves, as they wish. So my main focus is dot changes file. I want dot changes to be the basis of reproducing a given build. Then we could even go more crazy is that currently changes file are signed by only one developer. We could have dot changes file that are signed by multiple keys. So we can trust even more, like spread the trust to have more trust. So that's an idea. So if you want to help, that's one of the things that I really would like to make this experiment that I haven't had the time. Every large scale build experiment, what takes time, is also to sort the results. So if you want to help, we have a nice tool that is called DFP, that takes two dot changes file and we are like running a lot of crazy tools and tell you differences by diffing many file formats inside the dot dev. So that's that, is people like treading bug? That's one thing. Yeah, Paul? Just quickly on the changes of the multiple signers, that makes me think you need detached signatures from the changes file. You don't, actually. You can have multiple signatures in the same block. So you then modify the changes file when someone else signs it. By the way, let's go on. We can find the solution. Currently we have no, so my point is we need to put every single piece of information that is useful to reproduce a build in the dot changes file. Maybe that's not the best way to do it. So there's research to be done there. So that's one thing and then we need someone that would specify something. So we could record information and then replay it later, but that effectively means that if you build in the absence of that build log information, you get a different build. I think it would be more valuable if we construct a standard environment so that if two completely independent builds occur, they will get the same result without any logged information that needs replaying. And that doesn't seem infeasible for most of the problems that you've mentioned. Yeah, that's so, 10 minutes left. That's trade-offs here that might not be super easy. Any such standard environment would be constantly evolving because your build dependencies are constantly moving under you, but there are various tools that exist today. DH BuildInfo is one such that do log this kind of thing and maybe we should just standardize in one of those. Yep. So not sure if this is gonna derail your talk, so I'm sorry. Have you guys seen anything where it builds multiple versions of the same binary and then tests to see which one performs best and then ends up using that one? I seem to remember there was something with Firefox maybe that did that. What is it? PGO. PGO? Profile-guided option. Oh yeah, right, exactly. Have you seen that and are there techniques to avoid that? No, not yet. Well, I know that Firefox can be mad reproducible because the tar browser is based on Firefox and so there has been the need to stomp random bytes, free bytes that are like zeroed because nobody could get a clue why they were written to the binary at some point, but it's doable. So if you wanna help specify that's worked here and we need that as well, if you wanna code, then there's a lot of no timestamps option that needs to be added to software or if you want to do politics, you can also, once that code is written, try to convince that it should be the default option. And also we are still missing a script that would like, you give it a change this file and maybe a recommend environment from in the changes file or in other file, I don't know, maybe in the age building profile and it would like fetch the correct dependencies, version from snapshot, set up that in Pbuilder or as build truth and run the build. That would be an awesome thing to do. We know it's doable, no one just got to it yet. It's probably like a weekend project. And there's project management to be done here. There's a lot of baby steps that to move forward to the goal of reproducible build. I would love someone that is not completely fooled with so many things to just, ask me once in a while, I'll ask anyone interested in, hey, what are you going to do about reproducible build this week? And I can tell, yeah, I will look at this package. And maybe I will not, but at least I will like some, I don't know, some kind of fear fresher, just like, okay, what is we need to do next? I mean, during the last hack session, hack time, Ashish just told me like, you know, do something about reproducible build and I've sent a patch to DH Python that took me half an hour to write. This is a lot of small, small things that can be done. Ashish, you want to say in touch with the project? So there are the reproducible builds, wiki page, subscribe to it, and also main list. And that main list is used by, as a user tag in the DBNBTS to record every bugs we've reported so far as part of the project. And we had very good reactions from a lot of maintainers who promptly took the patch and that added like a minus n to gzip or added sorting five things to get a cyber stable order. So yeah, there will be a buff. So that's why I was cutting some of the questions. There will be a buff right after dinner to discuss technical solutions. So we have 45 minutes more to try to, you know, sort out the fog and make plans so we can get to total wall domination. No, so we get to a reproducible build. So yeah, that's my end of what I had to present so far. I want to give credits to, yeah, so Stefan Grondu has a lot of work at some point. He's not here. And David Suarez and a couple of other people who actually really got interested in the deep thing and not just like, tell me, yes, this is great. So, and please, I mean, this is not something that I or even like two people will fix alone. And, you know, this is an issue for the wall. DBN and free software community, even though I had a group. So let's, you know, let's do that. Just on that, maybe we can expose these, maybe we can expose these problems to the developer community as a whole via the package tracking system. The new tracker is relatively easy to contribute if you know Python and Django. So that would be really helpful if someone could add white and look at that stuff. So what one thing we're missing right now is a set of, you know, standardized, or at least the beginning of standards on what is the way to reproduce a build? Like what is the environment or what is, where do we start the information? Because right now it's been ad hoc experiments. And so I can't say, well, yay, bonus point. Your package is reproducible because I don't know what that means yet. We need to design on something here or just get to some code that we'll decide, but at least there's nothing currently. I would love to have something like an infrastructure like CI.dbm.net that would actually be just, you know, we're trying to reproduce packages. But we're missing a few steps. Any other questions? Is this, is this crazy? Is this a waste of time? Yes and no. Huh. Okay, troll me. Okay, so this is kind of an insane idea for the future, but like you were talking just now about having something like CI.dbm.net. It would be cool if we could put that in like an AMI or something so we could have this like distributed trust so that anyone who wanted to could, that's like an Amazon machine image. So anyone who wanted to could like run some sort of continuous integration thing and you'd get, like if the, if the Debian infrastructure got pwned, then someone else would catch it. Yeah, that's rough. We have only one. So we need reproducible AMIs. Well, that's another project Paul started, which is called like reproducible installs. Look at the wiki bit. That was rough and that'll be our last question. So on the comment of being able to reproduce this, I think that give it for, we're gonna talk about having developers upload reproducible builds that we can verify then in the reproducible build environment. That really means that we need to be able to document exactly how you set the whole thing up to generate a build of this type, which means you get essentially that for free. And I think that it's probably better to have the detailed documentation in the specific set of tools than it is to have it all as a machine image anyway, because that way you can verify each step of that. And given that one of the goals here is security verification that gets more trust in the entire process. And it also allows for the various variants and workflow because different people like to use CalBuilder versus LVM partitions or SBuild instead of PBuilder or all that kind of thing. Well, I mean my goal would be like you do, it's in GPKG and you do GPKG build package minus minus rebuild, give it a change file, boom. And you can do that on any DBN system. That would be the best user experience on reproducing builds that I would love to get. We'll see how far. I think we're at the time. So I will thanks everybody who came and who's supportive of that thing. Come to the bar if you want to do shit. Thank you. Yeah.