 Okay, hi, my name is Holger Leveson. I'll talk about reproducible builds, the status and where we want to be actually. So this is some blah about me. More important is that these are all the people in Debian who worked on this. And I'm just one of them. And there are many more people outside of Debian also working on this. First, I'd like to know something about you who has seen a talk about reproducible builds already. Okay, so half the audience or something who has contributed to these efforts. Okay, some people. And who has used reproducible builds as a user. In other words, who has reproduced something which they were using. So very few people, but a few. So about the motivation for why to do this. Free software is great. We can modify it, share it, use it, pass it on. But we use, that's all about source code and we use binaries. And we need to believe that the binaries are coming from the source because there's no way to really be sure. You cannot prove it. And I don't want to believe that. I want to be sure, I want to know what that's true. And I very briefly only explain the problem here now and this talk from three years ago from Mike Perry and Seth Schoenen explains in great detail why reproducible builds are useful. I just have a few examples. So this CVE 200283 was a remote exploit in SSHD and there was the difference was one bit in the binary. The mistake was that it was a equals comparison which should have been greater equals and the difference is one bit out of 500 kilobytes. So if you just look at the bits, you will not see it. And they also had a live demo with a kernel module which modified the kernel in memory, but not on this. So if you inspect the code, the code looks correct but if you build, compile the code then it will compile something else. And also it's really hard to protect computers which are connected to the internet all the time, especially if you have physical access then you can modify stuff in memory and you cannot really protect yourself well. And how much do you pay your admin? So the easiest way or one of the easiest ways is probably just to bribe somebody and subvert the system that way. And also there's legal challenges. There could be a legal requirement that the state says you have to put this back into the binary so you are not allowed to do business here. And as said, this is in this other talk very much better explained. And there's also this white paper from the CIA conference where the CIA described how they would theoretically backdoor an SDK to compromise the code which is built with this SDK. And then in 2015, no that's not 2015, Xcode Ghost. 2015 was when this was discovered, this paper. And in 2000, I don't remember, 14 or 15 there was this Xcode Ghost with vulnerability where somebody backdoored a SDK for iOS and put this on server which were faster to reach from China. So many Chinese developers downloaded that Trojan SDK and then there were 20 or 30 million compromised applications in the wild. And that was with good source code. And so our solution is that anyone can always and independently generate bit by bit identical binaries from a given source. That is what reproducible builds is about, is bit by bit identical. And so I used to have a demo where we built a Debian package five times and a year ago you would five times get a different checksum. If you build it now you get five times the same checksum. And that is really it. And we also say we include everything the build produces also documentation and data files. All in one should be reproducible because we just want to look at the results and not say we exclude these cards or these bits and these are not important. We just say everything matters, everything should be identical. And this also works with RPM packages by now. So RPM has been fixed as well. And signed RPMs are a bit more complicated because you build the RPM, have one RPM then you attach a signature to it and put it in the RPM again. But even if you want to rebuild that you just replay the same signature and put that in the RPM again and you get the same RPM. The signature will match because the data does match. And we think this should become the norm. So we really want to change the meaning of free software that it's only free software if it's reproducible. Like it's like a quality norm. It's all still software and still free software but it's crappy free software if it's not reproducible. And surely it's just one link in the chain for secure software. There's the whole software lifecycle management where you put the code, what code you've write. All this stuff is also there if you want to write secure software. But it's a critical link because it links the sources to the binaries and vice versa. And the problem with randomness is that you never can be sure. This old XKCD joke. The problem with reproducible builds is a bit different. That it's a lot of effort to only prove a very small part of this secure software part. But it's still a critical part because it's all the other effort which you put in the source code are worthless because you can never be sure that the binaries you're running are really coming from the source. Yeah. And there's more benefits than security. With our testing, we found lots of sub-tail QA bugs where the software built differently, different locale, timing issues, whatever things we discover lots of strange errors. Google does reproducible builds to save time and money. They have everything in one big resource repository and it builds just faster. Most results can be cashed. There's also smaller data. So there are smaller updates possible. I think Fedora does this. And there's also the site. In fact, there's a meaningful diff between two different source code versions. So if you only change one area of the source code all the others should stay the same and you can better diff that. Yeah. And to start with the history a bit, in 2011, Bitcoin were the first to did reproducible builds. At that time, Bitcoin was four billion. I think Bitcoin has now a market capitalization of 1,000 billion. And they wanted to be sure that the that nobody can distribute binaries where they say this is from the Bitcoin developers and then there's a vector in it and takes all the Bitcoin away. So they wanted to be sure or ensure the users of Bitcoin that their client is reproducible. Then toward it the same with their browser, 2013. In 2013, Debian also started, but really it just starts in 2014. And this year we managed to get it into Debian policy. So Debian policy now says packages should be reproducible. And there's also 2014 was the core infrastructure initiative which at the moment pays my bills or I bill them and let pay my bills. And other projects got involved. So FreeBSD, CoreWood, Lieder, OpenZoozer, NetBSD all started involved at last year. And Tails, this year just last week or the week before made their first Tails ISO image which is reproducible. And then this year we also learned that Cygnus in 1992 released the GNU tool chain for nine architectures in a bit by bit reproducible way. But everybody forgot about it. Like we worked on this since 2013 and only discovered it this year. So I think what I hope, what we've achieved by now that nobody will forget about reproducible builds anymore. I'm not sure, but maybe, hopefully. And so this is the progress in stretch. Green are the reproducible packages, orange the unreproducible one and red the sum of failing ones. So we have 94% reproducible and we got it in Debian policy. Yay. And I call this now somewhat a misleading success. Of course, it's still a long time. We don't have the infrastructure lacking. I will explain that in detail in a moment. And it will take probably till 2021 till Debian policy says packages must be reproducible and then we'll get close to 100%. And 6% is still a lot if you're talking about 25,000 source packages. That's still, what is that, 3,000 source packages or something, no less, it's 2,700. But anyway, there's still many packages. And it's also the Debian developer community really supports it, but there's still some hard corner cases. Like we cannot say packages must be reproducible and then we delay the next release for five years because there's 10 packages which are not reproducible. So that's still difficult. And also I hope I'm wrong but I only see two other big or relevant projects with similar commitment that's tails and torch. But for them, a small how-to is sufficient. Like if you want to reproduce the tour browser you build it this way and you get this binary at the end. If you rebuild tails, you build it that way and get that binary. But for 25,000 packages, it's way more complicated. You need infrastructure and stuff. And really this commitment of the projects is also the other part which I explain later. And then we are at 94% theoretically being able to do reproducible builds. But we are lacking infrastructure to distribute all the hashes. Like 20,000, 25,000 hashes multiplied with 10 architectures and users need to reproduce them, need to have tools and this is all missing. And Debian is the most advanced distro here. The others haven't even started. So if you think reproducible builds will be there soon, yes, maybe if other communities do the same. And so we need to keep doing what we've been doing and we need to do more things and we need more people to join more communities. And yeah, we made the first 90% and 90% of the time and then we need more 90% of the time again for the last 10%. So what we've done, now we have this webpage reproducible builds org which has how-tos, there's a mailing list, IRC channels, we have common problems on this webpage. We wrote different scope, different scopes, examines, differences in depth, recursively so it will take a depth package and you will give it two objects to compare so it will take two depth objects and finds TAR archive in there, in the TAR archive there's many files then there's a PDF in there which has an image in there go recursively and show the differences in the smallest object it will find. Does HTML or plain text output? It's available now in every major distribution. It's also on PyPy, it works on BSD and it's really, really cool. If you haven't looked at it, give it a try. You can also just go to try Divoscope org and upload two objects, it can be two RPMs, two ISOs, two text files, two anything and the result in HTML will roughly look like this where you see exactly what show where the difference is between two things. But Divoscope is just for debugging to finding out why is something unreproducible. If you want just to know if it's reproducible or not then you just compare the hashes and that's it. And we built this test reproducible builds org which is mostly testing Debian or three releases even also testing stable. We're doing this on four architectures, AMD64, i386, ARM64 and ARMHF. Those are the sponsors of the hardware but we're also testing the leader, the net BSDs. We did test Arch Linux and Fedora but the test bit rotted because nobody was looking at the results and then we stopped doing them basically. And there's 40 people working on the setup. The leader tests are well maintained, the net and free BSD tests are nicely and we apply variations there when we test. That is one thing. So the variations we apply, we do the first build with these settings and the second build with the other so we vary the time zone by more than a day. We vary the locale, the user ID, the file system, the CPU type if we can to do a maximum of variations that we can test what will happen in the wild. Because in the wild anybody will rebuild and they have strange hardware so we try to make it most variation as we can. We think there will be more variation in the wild but we hope to catch most of them. The common problem we found are timestamps, timestamps, timestamps and timestamps and time zones. Also really a lot of time zones. If you unzip a zip archive from 1980 your local time zone will be applied. So if you want to do this in your code you need to first normalize the time zone and then unzip it. Same with locales, the build pass is embedded and there's lots of small issues which are affect only five or 10 packages. Luna gave a talk at the CCC camp in 2015 where he gave many examples how to avoid that so we call gzip with minus n and other common things you need to do and we came up with source state epoch which is defined as the last modification of the source code because sometimes it is useful to include timestamp in there just not meaningless timestamps as the build time but rather the source code modification because that doesn't change, that is deterministic and meaningful. And in Debian we define it from the last Debian change log entry in RPM it could be spec files, whatever and source state epoch has now been adopted by d-package, by RPM, GCC supports it, lots of tools, I think it's 40 tools or more which supports source state epoch and will replace the current date with source state epoch if you build software with it. And we wrote two more tools, non-determinism removes some known useless timestamps from PNGs and other stuff which normalize it and re-protest, re-protest this is a tool which does what this Jenkins test set up on your local machine to use re-protest to build something locally, it will apply variations and then hopefully it will be the same and re-protest now also has a mode where you can say use these variations and then lower the amount of variations to find the variation which is causing the unreproducibility, so it will do several builds and there's different variation and you can see okay if I vary this then my build with is different so I would need to look on the code which causing this. So please do give re-protest a try, especially if you're not using Debian, we really wanted to work everywhere on BSD, on other Linux's, on Mac OS, please give it a try. So the Debian status, let's first start with Golang because that's shorter and this is a Golang conference after all, so Golang binaries are bit by bit reproducible which is EA but when the build pass is varied then not and that is quite common problem which is also very easy fix, you just rebuild in the same directory and then you get it but Michael Stapelberg also wrote a patch for Golang which is this one which is in Debian main where you can vary the build pass and the result will still be the same and for this we came up with in the second specification this build pass prefix mix, build prop or these names, build pass prefix map and this specification describes the environment variable for build tools to exchange information about the build time file system layout to generate reproducible output where all embedded paths are independent of the layout and that is nice and theory and our biggest problem at the moment is we have a patch for GCC because GCC also embeds it and the GCC maintainers are not happy with the implementation and we are discussing with them and that is the problem with this build pass thing is also the workaround is so simple just rebuild in the deterministic paths but we want users to enable to rebuild and also I think it is why should the paths be embedded in the binary at all? It should not be there, it can also have privacy amplification if you build in home projects, BLAR secret project you don't want to have that leaked into the binary and so when we test Debian unstable we vary the build pass so we have worse result in Debian unstable while when we build Debian testing we don't vary it because if we want to have reproducible Debian now or in two years we will just say build in a GCC two years we will just say build in a deterministic paths TMP, source and package name, minus version or something so this is Debian unstable and at this point we introduce the build pass variation and there we went from 90 to 70% reproducibility but now we have catch up like we are now at 86% again while in February this year we were at 78% so we fixed lots of things already one other thing we have for Debian you can just go to this URL and see the package status so for all 25,000 packages you will see the status here we also have 49 package sets like build essential, the base packages all KDE packages, all Java packages or whatever because if you just want to look at a small part of the archive and we have some nodes it's a simple YAML repository where we take nodes about certain issue classes and packages affected by it it's over 6,000 nodes now and we want to do this at the moment it's Debian only but we want to do it cross distro because many issues are the same in different distributions and one other thing we came up which is central to our concept are these build info files build info files describe the sources the checksum of the sources all the dependencies needed the environment to recreate it and the result and so the idea is that a user can take a build info file has all the information needed to recreate the sources and this part we have defined and working with the infrastructure we're lacking is the infrastructure to distribute these build info files and for other distros non-debian this is not as clearly defined because in our test we always just rebuild at the same time so the build environment is basically the same using a build info file you can recreate the same build environment yeah we've also filed over 2,000 bucks with reproducible issues in Debian I don't know how many of them went upstream I hope one third but I lack the exact numbers so as I said in the beginning this is oh and this is also just the proof of concept for the stretch case all the changes are in stretch the source code of stretch is 94% reproducible but because of the way Debian releases we don't pull archive rebuilds only maybe 20% of the stretch binaries are reproducible that's really a Debian problem that is nothing to worry about for you if you're not into Debian and the other problem we have is that we don't distribute this build info files yet they are only accessible for Debian developers so this is what I said we are there theoretically but not in practice but in practice other parties canonical could take stretch or unstable and rebuild it and release winter which would be 94% reproducible if they rebuild everything Debian 10 Buster will be partly reproducible in 2019 also the next release which is still some time and I said about policy that packages should be reproducible we hope that for the release after the next ones for bullsire it's called in 2021 we'll have Debian policy say packages must be reproducible and even then if whatever lipo office is reproducible but as an example if lipo office is not reproducible I guess we will release with unreproducible lipo office because we need it and it will not be lipo office but rather whatever some other important packages there's 200 key packages which are unreproducible and they still work there so yeah by now it's pretty obvious that we there's many people from Debian in this project but we care about free software in general so we write weekly reports every week a blog post we are number 130 now we made two summits so far where people from 25 projects meet and discuss for three days and do brainstorming roadmaps we'll have another one in two weeks in Berlin if you want to join please talk to me it's from Tuesday to Thursday and we do Google Summer of Code and Outreachy projects where you usually mentor people from the status of the non-Nebian world I will skip the BSDs Arch Linux, Eftroyd is also interesting Leader I will not mention them much more the funny thing is NetBSD and FreeBSD FreeBSD was for a long time at 99% reproducible their base system and then NetBSD first re-channeled percent that was really funny but that's only for the base system not the port system there's other projects Google Basel is a build tool from Google which aims at reproducible builds there's doosable build tool for windows so you can do reproducible builds for windows which is a small detour commercial reproducible software we have medical devices in our body arms, we have nuclear power plants we all run crappy software nobody knows what's in there but for gambling machines the state and Germany and France demand reproducible builds value at a tax anyway so Bernhard Wiedemann started with reproducible SUSE in 2016 and these are his results from 1st of October and he didn't give percentage but that's also 93.7 or something percentage of the SUSE packages are reproducible and these are his main sources of undeterminism Yavadok, LaTeX, Mono and QT so all documentation basically and we haven't included his SUSE results into the website yet but we want to do this so that's easier to compare and Bernhard also created this getrepo where he's actively sending patches upstream so he's actually been looking at many Debian packages where the Debian maintainers didn't send them upstream he sent them upstream and we joined him now there and so in RPM in general in respect to the Epoch human DNF can be used to recreate environments there's DiffoScope and the science RPM thing is also solved so the technological foundations are there and Bernhard is there Bernhard is really doing an awesome job in SUSE the problem is also Bernhard is only one of the few people in the SUSE Fedora world who's working on reproducible builds so please help Bernhard and so there's not no or not wide community commitment to that or management commitment there's no build info files no tools to use them of course there's no user tooling yet and this is not limited to the RPM world actually that is most everywhere Debian also has no user tooling, Debian has community commitment but no user tooling, that's what we're waiting for so far we've mostly worked on making reproducible builds possible but we need to do constant tests in the future because every new release can introduce new unreproduciabilities so we need to constantly test that and find that and we need tools, infrastructure and policies to become meaningful and used in practice so that users can really verify that what we're saying is true so we want distributing these build info files and we want people to enable to do rebuilds so that they need the checksum so we need to distribute the checksum and there's not really much work done on it it's really different for different projects because we have different distribution mechanisms that will change and then we don't really know who should sign these build info files individual developers or do we want to have big rebuilders like the CCC, the NASA or the NSA, Deutsche Bank the Russian Army and you can pick whom do you trust when they rebuild it this is all not sought out somebody needs to do something on this maybe I will really go to the CCC and ask them hey can you set up a machine and rebuild Debian I want to do it outside of the project and then we need user tools do you really want to install this unreproducible software and do you want to rebuild it before you install it nobody has done it maybe you can rebuild it and then if it matches then install it and how many checksums do you need to call a package reproducible for you and what do you do if one doesn't match maybe the Russian Army or the NSA wants to subvert you what do you do then if you want to get involved as a software developer stop using build dates please use source state epoch attempt the summit in Berlin if you want to form your reproducible builds team it's really fun you learn a lot of things cause you look a lot of software and the best way to get started is just build something twice look at the results with stiflescope and then try to fix it yeah is there still time for questions these are the resources we have two IRC channels actually it's reproducible builds and Debian reproducible you can go to any of these we have mailing lists also general ones and there's several others we have a Twitter feed and these are these two talks which are also really recommended to watch you have questions you talked about the lack of user tooling how do you see that is that something which should be integrated in the existing package managers or separate clients other software what do you think is missing there and we have for Debian this do you really want to install this we have a patch for this for UPD but we lack patches for DNF and for other things also for the BSD it's unclear how to do that leader is reproducible core boot is reproducible but all the user tools to just simply do that don't exist so it's all theoretical at the moment it's great like three years ago nobody believed it or many people didn't believe it was possible now many people think it's possible but to really verify it you need to do a lot of manual steps not many manual steps but some and if you want to have thousands of packages installed on a machine then I need to do many manual steps quick question are the checksums per package or per artifact within the package so will different files in the package have their own individual checksums or is there one single checksum for the whole package it's in the Debian case it's one checksum per binary package okay is there any consideration of checksums per file in the package no because we really want the whole because the package can consist of ten thousand of files and we want we don't want to say these files don't matter and these do matter because then you need to evaluate what matters and why we just say the whole thing needs to match my reasoning for that was that if there were checksumming per file so let's say the binary had one checksum and then the documentation files they had their own checksums then maybe you could share them between distros and sort of consolidate the effort so that not every distro has to rebuild their own tooling and figure out the checksum distribution and so on the binaries will differ between the different distributions because there's different libraries used and doesn't work just a thought if you want to talk to me later that's difficult because I will leave the conference sadly here for one more hour so either ask me now or use email or IRC just curious have you seen the kernel version very rarely it should not happen but it sometimes does happen for example if there's new kernel features then it's of course happening sometimes software just embeds the kernel version like they write a log and then they include the log in the artifact then the kernel version is there have you reached out to the Yachto project because they try to isolate their build environments for every package that they build and I think they might be interested in getting reproducibility into that about the kernel thing is we often find problems where it does matter and then it's a bug like when there's build time optimization of the code that's usually a bug because you want to have runtime detection of the CPU of the features ok, thank you