 So, hi. My name is Christos Roulas and I've been with NetBSD forever. And for the past year, I've been trying to get NetBSD to work with reproducible builds. So, my first, this picture is very relevant to me because although this is a very old picture, my first contact with reproducibility is my homework when I was in elementary school. And although they had a much more modern version of this, that's how they gave us our homework. And, you know, it looked almost the same, so we were, it was kind of consistent. But why do we want to have reproducible builds? So, first of all, you know, would you buy a product that's different every time? Like, let's consider you buying a soft drink. Like, would you buy a product that tastes differently every time or depending on if it was manufactured in the States or in Paris or, you know, if it was manufactured today or tomorrow and tasted differently? Yes. Yes, you would. Well, I don't think it was difficult. Yeah, it was good. But anyway, or depending on which factory built it or if it was raining or if it was hot or cold while it was built, well, I don't think so. And, you know, in science, reproducibility is one of the major cornerstones. And, you know, very lately we have seen even there, you know, complaints from the scientific community that there is kind of lack of reproducibility. People are coming up with, you know, very wild claims and other laboratories around the world have trouble reproducing them. And this is because we don't really have a clear process to go from the ingredients or from the basic premise up to the result. And so there is a lot of gaps in the way that we're describing our processes and these kind of gaps make reproducibility possible. So, you know, in software we're doing well with open source, but then we're also doing poorly because we don't have a real good engineering process when we build things. And, yes, you know, we, you know, can sign the, you know, the result of the build, the package installed media, we can sign the source that built the media that provides, you know, that we trust the person who built it and we trust that nothing has been tampered with in the process. But, you know, can we verify that this person built this source and produced this binary artifact? You know, and today, for most things, it's impossible because, you know, even if we take the source and build it ourselves, it won't match. So, reproducibility is all about the ability to make sure that the path from, you know, the source tree or the repository down to the distribution media, the CDROM, let's say, is producing exactly the same results. There are people who are saying, this is a waste of time and, you know, we shouldn't be doing that. We have better things to do, you know, instead of spending time twiddling bits and times, we should really be fixing bugs. You know, all of this 100% reproducibility is kind of useless. You know, we should make sure that at least we have certain things that are reproducible, for example, the results of the compiler. You know, if you build this particular binary, this particular compiler flight, it's the same all the time. But not go all the way from source to binary to CDROM. And that makes life really complicated for everybody because, you know, you end up in the CDROM that looks like a non-ion. You know, you have to keep, you know, picking up and picking up and picking up and taking things apart until you find out why, you know, this checksum of that CDROM doesn't match to the one you actually just built right now. And, you know, yeah, the implanted factor is real, but, you know, it's getting less important if you sign things. And bitwise, the reproducibility is attainable. Lots of, you know, open source products, if you look at reproduciblebuilds.org, you know, have achieved that. So I was really lucky to start with NetBSD, actually, because it made my life very, very simple. People have done, have done most of the work already. You know, NetBSD has a single source code repository. I don't have to go around the Internet and fetch starballs to build distribution. There's an integrated tool chain that means that already, because people have been thinking about this, everything that we suppose to be building binary artifacts that are participating in the end product has been toolified, so that means that there's no external dependencies to the operating system, to use the operating system tools to build things that I need to be reproducible, and there's close build enabled. So basically, what it means is I can build on a different operating system, and that's what Debian and Debian folks do with reproduciblebuilds.org, they just download our source, they run build.sh, and that's all they have to do to get the full NetBSD build, and we can build different CPU architectures trivially on cross-builds, because these days, nobody wants to build on this low CPU they have, they build on the fastest, so that means that for most architectures, this is a different architecture than the target architecture. So if you go to this website, you see basically what is, you see exactly this, the red stuff is mine, so they build things twice, they add a few variations which I'll get later, there are comparisons of the tool called DifroScope, which is basically diff on steroids, so it's responsive, this tool is a great tool, it's responsible for actually peeling the onion, you give it a CD-ROM and it goes down and takes it apart, apart, apart, and finally it finds a source of difference, and then it runs the appropriate tool, for example if that's an Elf file, it runs, for example, Redelf, and compares the sections, and then brings the hex differences, and all of that is on the website, so it's very nice, because every week I went there and I was disappointed by seeing my builds were still different, but on the other hand, they would just pinpoint all of the parts of the builds that were different, and that runs automatically, and right now, since I think the beginning of the year, in February-March, we are fully reproducible on both architectures, on Spark64 and X8664 that gets built weekly on Debian. So Debian varies a couple of few things, as you can see, some of the things are not being varied, so it varies path, language, environment, time zone, etc. And it also varies the U-mask, it doesn't vary the CPU, it doesn't vary the file system, but these are the things that people are considering as sources of differences that prevent us from doing reproducible builds. Well, if you distill them down, there are really 10 categories here. So there are timestamps, as dates and times embedded in the source code, we build things depending on time zone and embed timestamps that are time zone dependent. Parallelism is another issue on the source of order. If we try to build things with random data inside them, a really nasty one is path normalization, the tools that I mentioned, build parameters, environment values, and finally it would be nice to be able to build as anybody and be able to build a CD-ROM that is exactly the same. So what timestamps do we use? So we have decided that the best time to use is the latest time slot or the latest commit. And we call that MKReptor timestamp for historical reasons, we might change that, so it's exactly the same as Linux. And all the file system objects get this timestamp. So to find it, when we start the build, we have to get the timestamp the latest commit. That's very easy with git and mercurio. Unfortunately, CVS, we use CVS, it's not. So I don't want to get too much of that slide, but let's put it this way that CVS takes a view of the repositories directly by directory in file by file, there's no kind of global view. So there is also an issue with updates versus checkouts where because updates make the file have a timestamp. And that's because it makes sense, let's say that you're in directory, that you're building a program, let's call it foo.c, and you just build it. Now you see CVS update and there's a newer version of foo.c. You want the new foo.c you just download to have the latest timestamp, so it's newer than what it was just built because it was just updated. And so that it gets rebuilt next time you run make. On the other hand, this is not the right timestamp, so I just added a flag of minus t to check out consistent timestamp, so when you update your tree, make reproducible builds, you can do that. And finally, I wrote a tool called CVS Latest that scans the CVS repository all the metadata files in the CVS entries and finds the latest timestamp, and that is becoming your timestamp. Then you have to add timestamp support to everything that embeds timestamp in its output format. And the first three are obvious, backstab and make.fs, the second one less so, but it is because it uses dynamic UID generation based on timestamps, so this has to be deterministic too. Our individual builds already has minus t for deterministic builds, and that basically makes all the timestamps zero, and the UID and GID embeds in the archive zero, which is very nice because it's not very useful. People don't use R as a file transfer to preserve timestamp format anymore, so the timestamps either are less than useful. And finally, all of the documents that use the macro to print the date that the document was formatted need to be changed. And for that, you know, experience says just take it away, so I just disable it for, you know, conditionally the make file if it was doing reproducible builds. So, for embedding dates and times, you have to remove these three macros from the sources, and that's what I did. And, you know, eventually you can put them back and have CPP obey the environment variable to build, to put these variable, set those variable based on the correct timestamp. And again, there are five system formats that want to obey local time, like ISO images, and for those, again, you have to make it consistent, so you just choose GMTL in this case. So, going to the next step, you know, we have directory and sort order. So, things that scan sets of files and from directories to build artifacts they need to order those, and then, which was easy to do, you just sort them. But the most problematic one was basically install info. So, when you do a part of the build and you use install info, every program that you build, a pack puts its information inside the global key profile, and that means that this ends up being out of order. Since, you know, it's complicated to fix every version of the tools out there, I just decided to just write a simple text processing tool to sort them after the build was done, each build was done. By far the most complicated and painful one was this GCC. And GCC has many, many different nuanced issues here. The first one is basically the expansion of underscore underscore five, and for that, you can use minus I remap, which is netBSD extension. By the way, there are patches for these in the netBSD website, but there is a controversy of how many or which one of these and how to do them exactly. So, the first one is easy, you just remaps the pack. The second one is a little bit more complicated. So, there is a minus I remap prefix map inside GCC right now. The problem is that you can't really put the source path in there, and that's why this thing is quoted around it. Because if you put the source path, what happens is that the expanded source path ends up in the DW80 producer and DW80 comp DIR path. So, here, oops, sorry, there, there. So, you see basically, if I was putting the full source path there, I wouldn't get this symbolic path, I would get the expanded path. So, that would be different in every build depending on where I build. So, the extension here to GCC was basically to go and expand the environment variable in the source and it finds a dollar instead of having to put the expanded path there. And finally, the more complicated stuff was after this is applied, so basically now your tree is already normalizing the user source, depending if you're building on NBSD with object directories or not, your build path become different. So, I added another option called debug regex map. I mean, I guess that GCC people are probably taking this one, they will never take that one, so I don't know what to do about that. But nevertheless, what that does is it uses kind of a regex capture syntax, like I said, to map things around so that things are being consistent if you're using object directories or not. The same thing trivially is added for lint, again, for lint libraries. And that does it with paths from GCC. Now, unfortunately, we have to deal with symbolic lengths. And, you know, it's fine to limit your build saying, okay, it has to be routed at a particular directory and that has to be user source. I mean, that's a limitation. But we don't want that, we want people to build without root with whatever directories they want to build and those directors can contain semblance anywhere. So, the problem is that programs, when they start up, they can either believe what PWD is, verify it, or use getCWD to get the current working directory. So, typically, what we want them to do is we want them to be consistent. We make, if you type cd other dir and make. Depending on what shell you use, this can actually, you know, take the logical path and convert it to physical before your problems make. And that kind of screws up things for us. So, the solution there is to basically make, make, obey the working directory and then instead of using the shell to change the directory and then in both mail, make to basically use make directly to tell it to change to the path. That feature already existed, so, you know, you can do that. And we did that because make is already a tool, but the shell is not. So, basically, we're using the operating system shell, so we can't depend on it, but we build our own make so, you know, we can depend on it to do the right thing. So, what is a tool now? So, this is the list of all of, you know, machine-independent tools that participate in the build. So, these are all the programs that produce some kind of output that needs to be consistent and needs to be guaranteed to be there. Some of them, you know, they don't exist in certain tool chains. For example, you can build on SQL, you might not find. Some of them might not be installed and you don't want the build to break. Or some of those might, particularly in MPSDs, that's a jennison. Some of them are used for carvers like compiler T. There's a lot of them out there, but this is the total set of machine-independent tools. Machine-independent means that no matter what the architecture is, I need those tools and they have to produce cross-architecture independent builds. Then we have the tools that are machine-independent, which is basically your standard tool chain assemblers, linkers, et cetera. You will notice that make is both machine-independent and machine-independent. That's because make knows about your current architecture. This make is the shell script wrapper that is architecture-specific, where the other one is the architecture neutral. The other thing that's very complicated, as you well know, is that every package has different build options. You can build with different backend libraries, with different defines, different features, this and that. The MPSD process avoids all of this by providing build defaults. Every program that is packaged on MPSD has its own default. Unfortunately, we don't yet provide tool isolation, so if you have an mk.conf that has different values and you don't specify that it shouldn't be used, that will overwrite some of the parameters and produce incompatible builds, but it's simple enough to fix. We just haven't done it yet. This is a set of parameters, for example, that have values, not just booleans. These are the tunables that we have in MPSD. You see, there are just too many of them. If you wanted to build a build for every single combination of them, you would have, I don't know, what is it like, 12 times 7, 2 to the 12 times 7 kind of combination of builds. That is unwieldy. Some of them we should just get rid of. As far as the last part, which is the isolation of the build environment to provide a reusable build, you can go for the extreme case where you totally control the environment, which means that I'm going to only be able to build reproducibly if I have a VM. I know the exact VM parameters, and then I can only build in the VM, and this is the only way I can make a reusable build. That's the easy part, but it's also the least satisfactory one. But you have to go both ways. You have to both fix the source and fix the things that are easy to fix, and then the things that have value are supposed to depend on your environment to give you a reproducible build. So, for example, all the things that we did with sanitize the source and sanitize the build system are great, but if you wanted to build as a non-root user, it's harder. But fortunately, NetBSD has already done that, and so you can just pass it flat with the build to basically build them privileged. And the way you do this is basically we teach all of the programs again that produce artifacts that contain user information, like packs, make effects and install, to be able to take a specification and produce them. And that specification is with a program called M3, which is very common on all the BSDs having, and actually we recently synchronized it between freeBSD and NetBSD, so it's almost identical. And make sure that the only thing that actually installs binary artifacts, those being directories or files, is install. And then, when you do that, you can only use install to install those files. And by specifying this flag, the install program, instead of setting the permissions on the destination files, which just appends them to the metadata. So, when you actually build tar files or you make a file system, you can tell it to consult the permissions for those paths from the meta log and put them inside the binary artifacts using the correct user. So, you can actually build without being root and end up with the same results. So, to build NetBSD current and NetBSD8, they're all already built with reproducible flags. You just need to say build with the say, find install.t, which means that if you go to the release engineering website and download the binary artifacts of a particular build at the source state for a particular architecture, you should be able to run the same build, you build the same, that source you just downloaded, and if your binaries that you build should be identical to the ones that you downloaded from the build server. So, their own disability is not checked, though, it means that we yet are not doing two different builds with variations like Debbie and us to make sure that we haven't violated the releaseability in our builds. We should absolutely impede the SSEAP patches, and then, as I mentioned, we've got a lot of time data reporting on bootloads, for example, SVP, Microsoft, and Rock, for expediency to basically have reproducible builds. We could put those back to just be consistent. So, you can see, as you can see here, this is one of my machines for Broadway, and you can see that the shell has the right timestamp when you list it. If you stack PS, again, it has 1231 sources to do, and if you run uname, you can see that, again, that was the timestamp that the kernel was built. It's pretty neat to see the whole file system have one timestamp. There are other bugs, and we need to add more sources around the list during the build. One of the very important ones is that we have sometimes uninitialized memory and setting it to different values between builds can actually reveal more sources and more bugs that we have to fix. It's a lie that, again, big help. Basically, what that means is that if you're storing pointers that you have my look on the build, this changed. Instead of being always the same, so when you compare them, you can see that they're actually storing build pointers that make no sense as opposed to zeroes in your data structures, in particular this one, we're storing stuff inside the super block of a profess and make a fest. Then, you know, there's other randomness that is in tools that we haven't fixed. I mean, basically, there is a bug in GCC on some of the RISC machines when you build with Profile. It uses the function ID number to build labels and that function ID number changes depending on the order of optimization of some functions, so you have to turn the optimizer off to produce consistent results. Partly, there is some randomness in sorting the functions or processing the function or the function. Finally, I would like to thank the NEC-PSD foundation for having all of this stuff almost ready to go and the Debian people for giving us both the infrastructure and being a very strong force for everyone to work towards getting most open source to be reproducible built and supporting great tools such as DifoScore and the individuals who actually worked on reproducible builds like York and Thomas and the people who worked on Blue and Todd who worked on Build at this age. So, thank you very much. Questions? I have a really good question but I just have something to say that you might also find interesting and if you find a solution for it, then it would be awesome. So, probably, yeah, I also need to do some practice builds. And I know that if I build on Debian I get different binary practices so inside the client there's actually some optimisation that optimises how a constant calls to the event so for example, the sign function with a constant in the building which is often called in the word value but the value that it builds into the assembly depends on the sort of the math library you have in the system so the limits of math library study get compared to ESDs and you just get like the last couple of minutes that they're different. Yeah. That's a tough one to fix. I mean, this tool chain these tool chain bugs actually you know, they're hard to fix. I mean, there are different ways of fixing them. Yeah. Yeah. I also think that depending on what it uses with ESD it uses a different hashtag optimisation internally which calls us in the end that certain instructions in the assembly are placed in different order. Yeah. Yeah, this is the both frustrating process of you know, getting to the last mile and saying, you know, that I finally produced two identical results some of them, you know are really nice to fix and they don't depend on you and you have to have a lot of buy-in from upstream to tell them, look guys, you know we want to produce for bills so make it reproducible. Anybody else? Thank you very much. There's one more thing. It has an article in Python that some efforts in other Vietnam are finding parts related to the integration of data structures so they're perfectly adding new options to for example, for people's integration of certain data structures or having extra different integration of all the data structures that can impact and to flush all these kinds of parts. The second one about the concentrating just don't do that. Don't depend on concentrating on transformation very much. It's supposed to be fine for all the functions whereas other actions as the result must be precise. Thank you. Alright, very much. Thank you. It's a comment, not a question, but over from the WN Group Routes project it says hi and we'd like you to come to the Group Routes Summit coming up in a minute. Okay, thank you very much. We're leaving November.