 Okay. Hello everybody. Hello. My name is Joshua Locke. You made my belt non-deterministic, prepared to sigh. I'm a staff open source engineer at VMware and this talk is an expansion of a Twitter thread I wrote in response to the question, how many of the author's readers had reproducible builds? My thread started with the image on the screen. Having worked on build systems and their security for over a decade at this point, with around 10 years of those being a core contributor to the Yocto project and open embedded in the past three years also being on the update framework and salsa projects. I wanted to share my experience and discuss the three main interpretations of the term reproducible that I've encountered when I've been talking to people about build reproducibility and build outputs. And with me today is my colleague. Hello. I'm Rose Judge. I'm a senior open source engineer at VMware focused on a combination of software build materials, containers, and supply chain security. Joshua and I are here today hoping this talk helps you think about what outcomes you're trying to achieve in the context of reproducible builds and whether your stack supports achieving those with a reasonable amount of effort. There we go. So as Joshua somewhat alluded to, reproducible builds, they aren't new, right? Efforts around reproducible builds have existed for at least a decade now. But the rise in popularity and focus around reproducible builds certainly feels like it is increased in the wake of a surge in supply chain attacks. We can even see this increased focus right here at the conference in the fact that there are two talks on reproducible build alone. And what this ultimately comes down to is a heightened interest in how we can utilize reproducible builds to defend ourselves against supply chain compromise. And as it turns out, having traceability to source during your software compilation process is one of the best ways you can do this. But the benefits that reproducible builds can deliver stretch beyond just supply chain security. Their rise in popularity is as simple as the fact that they are a desirable part of good engineering practice that we as developers are striving for. For example, reproducible builds require enabling caching of intermediate build artifacts, which makes it easier to reproduce, trace, and debug future issues. This in turn requires a thoughtful storage method for storing those build artifacts, which leads to things like storage optimization and improved content availability. And ultimately, implementing reproducible builds encourages a certain level of overall thoughtfulness and consideration when developing and delivering software. So yes, reproducible builds aid in supply chain security and engineering best practices, but why else are they important? Well, they're important because they reduce the need to trust build services, which as we've seen with the SolarWinds attack can be a real threat. So whether that build service is a dedicated production machine, a CI instance, or your laptop, instead of having to implicitly trust that a build service is not compromised and not injecting your build with malicious code, if your software is bit for bit reproducible, you can easily verify and confirm that no backdoor injections have taken place simply by rerunning the build in a different build environment and comparing the results. And this gives you flexibility in which build systems you can use. Reproducible builds also can provide assurances around what software has and will be shipped in the future. So if you know that your build process can be 100% bit for bit reproduced when given the same set of build inputs, you can trace any release past or present back to source. And this is helpful for things like recreating and debugging customer issues when they arise in a release, even a release from two years ago, or determining if a developer build system has been compromised, or maybe a binary deliverable is lost somehow. If your build is reproducible, it's a low effort task to rebuild it exactly as it was. And really, all of these benefits lend themselves to the general theme of reliability for your software. So it's reliable for developers because the software they're developing is easy to share between different development and production environments. It's easy to ensure that build artifacts have not been tampered with during the release process. And there's a reliable caching mechanism in place to record the ingredients for these builds. From a customer standpoint, it's reliable because they know the software they're consuming is secure. They know that checks for tampering have taken place, and they know that should they need an older version for anything, it's available to recreate without much effort. And when I say all this out loud, it sounds obvious, like, great, I'll run the same steps with my source code, I'll have a reproducible build, and I get all these benefits. But what do we really mean when we say reproducible? Yeah, that's a great question. And I think most people intend the same thing when they say they want a build to be reproducible. But the reality is that modern software is complex. So we might say that when I run my build again, I get the same output, that's a reproducible build, but maybe we don't define same output the same way. And that asterisk here is the ambiguity of the statement is really the space in which the talk that we're giving today exists. The reality of our deployments doesn't always meet our expectations, our intent. Differences are often a matter of the software stack we're using, or maybe even the experience of the engineers involved. So when we talk about what it means to be reproducible, it can be that I invoke a build and have the same sequence of commands execute, that's one potential interpretation, or that I can do that. And regardless of where the build happens or when the same sequence of commands is executed and the same equivalent output is produced, or maybe I can do that and get an aspect of the same cryptographic digest, or as Roy said earlier, that's a bit for bit identical output. So during this talk, we'll provide three interpretations that we just loosely stated, which you can think of as levels of reproducibility. And each of those levels has its own supply chain security implications. So I mentioned intention versus practice and there's several reasons why reproducibility is not achieved despite the intention. And we're going to uncover these a bit more as we continue to talk, but at a high level, the three main reasons are build system behavior. So most build systems that we as software engineers use today, we're developed in more innocent times, or perhaps with different priorities, a build system that prioritizes enabling new users to adopt an ecosystem quickly, for example. And these build systems, they capture states often with the intent of adding debugability or just unintentionally capture some kind of state that has implications on the reproducibility of the build. Another factor is that modern software systems are extremely complicated and they tend to be composed of multiple components in multiple layers. And unless you're bootstrapping that entire thing and the tools that you use to produce it from source, which is clearly a non-trivial effort, the layering and the interaction of these different ecosystems can work against you when it comes to achieving reproducibility. And then the final factor, it's just cost. Achieving reproducible builds today takes engineering effort. Achieving reproducible builds requires some long-term storage. And for many organizations, that cost is either unattainable or not easy to prioritize. So fundamentally, I think as well, but purely or strictly reproducible builds are just not something that most engineers, technologists think about. And arguably, they shouldn't really have to. So as Joshua mentioned, we've come up with three different terms and associated definitions for how we can define reproducible. And each of these definitions provides a different level of reproducibility. Each have their own unique quality guarantees, but also their own limitations as to the security assurances they can provide. And as we ascend to each of the next levels here, we'll all get a little closer to unlocking the galaxy brain inside of us. So on this slide, I'll cover the high-level definitions for each. And then we'll continue to dig deeper into some of the nuance surrounding each definition throughout the rest of this talk. So even before these levels, and as you saw in the previous slide, the most elementary level to all of this is an unscripted build, meaning some piece of software is built without any sort of automation. Instead, a human is manually running a series of steps to create the final artifact. And even if the same series of steps are being run, we can't assume that this process is repeatable. There's human error that might be introduced. There's different build environments to consider. And without any sort of build script, we're not able to reproduce this final artifact consistently. But as we move towards reproducibility, we get to this first level, which we're calling repeatable builds. And repeatable builds control these steps for a build. So this means that the build is scripted. And therefore, I can reliably repeat the build process with some level of guarantee that the steps will be executed in the same order, regardless if I'm running the build on my laptop or a CI system. And regardless of where I'm running it, because I'm using a build script that is inherently repeatable, I expect this build to do relatively the same thing, whether I run it today or two weeks from now. The next level of reproducibility, we've coined as rebuildable builds. Rebuildable builds control all explicit inputs for a build. So if your builds are executing at this rebuildable level, you're using some sort of build script. But it also means that your build processes and infrastructure systems are capturing some of the states that are not controlled in a repeatable build process. And by controlling things like artifact repositories and cash give intermediate build artifacts, you're able to produce an equivalent artifact that can be reproduced at any arbitrary future point in time. I did not use the word identical here. And that's on purpose. That's because a rebuildable build will not guarantee a binary identical artifact, but it will guarantee an equivalent artifact and we'll cover nuances between those definitions shortly. The last rung of this reproducible builds ladder can be thought of as the end goal that we're all striving for. So it builds on all the levels before it. And it's what we're calling binary reproducible. So you may have also heard terms like hermetic or deterministic builds. And those mean the same thing for the purpose of this talk, but we felt that binary reproducible is the most clear and descriptive label for it. So binary reproducible builds control all states of the build. So this means that regardless of when and where the build is run, when you run a build with the same inputs, which by nature are fully detailed, you're able to achieve bit for bit identical outputs. And when we're talking about these definitions, you might hear us use the phrase regardless of when a build was run. And I think sometimes we hear this and we think, oh yeah, the build should be the same now and two months from now or three months from now. But I want you to stretch that timeline a little bit and think of it in the scope of years. So if I ran a build a year from now and a build today with the same inputs, how different would those outputs be? Or is that even possible with my build process right now? So for each of these levels, there's an inherent overlap. Each subsequent level necessitates the one before it. Rebuildable builds are founded upon the same principles as repeatable builds. And binary reproducible builds incorporate all the requirements for repeatable and rebuildable builds with additional build process controls. At each reproducible level, we see an increased security benefit and more protection against malicious backdoor injections. Okay, so let's dig into that first definition of it. So a repeatable build, as Rosa's already described, is a build we can repeat on the same or a different system and expect the same steps to be executed. Repeatable builds require that we have some kind of scripted build process, which is just frankly saying software engineering. We need to be able to run the builds and have them behave the same way, whether it's our workstation or our CI system or a colleague's workstation. And we might achieve this with a makefile or maybe with the common build tool for your ecosystem. But however we do that, we have something that everyone has access to and is virtual controlled that we can use to run the build. But a repeatable build is often temporal. If I run a build script today and then run it again in three weeks, I wouldn't expect the same things to happen. And mostly that's what we already touched up, that modern software utilizes a significant amount of kind of layering and a large amount of third party and open source components that are typically retrieved at build time from package repositories and where these direction transitive dependencies are usually not life cycle managed by default. That introduces opportunity for the build to behave differently in future. And so managing the burn for the complex layer cake of cloud software becomes really difficult or really even any modern kind of software build of complex software. It can be challenging to manage the tree of dependencies. So I think I just expand on this a bit but repeatable builds really apart from that dependency problem, the other factor is in what makes them not a suitable solution, not the end goal that we're looking for is that the tools involved by default often capture too much of the local system state. So a typical build script and the toolshades that are called will often include various factors of both the system state and the user's configuration when it's running the build. So you might see things like paths and timestamps embedded in the outputs, how the software behaves might be affected by time zone obviously with the embedded timestamps or locale because different locales sort strings differently for example and these all impact the produced output and prevent it from being binary reproducible. And the dependency problem is often underspecified in a scripted build. So typically scripts might capture the top level dependencies for a project but won't dig in and capture all of the tree of dependencies that they in turn build pull into the build. So even if a user is diligent and pins their dependencies, the indirect dependencies, you typically represent volatile inputs and these any features over the network can and will fail in the future whether it's through some kind of accident or malicious intent. So it's almost impossible to rely on a download from the internet so that will affect our ability to repeat the build picture. So in summary, repeatable builds give you confidence when you need to be able to reliably perform the build again and one of the probably the biggest or perhaps even the only software supply chain security benefit of them is that when you need to release a build on short notice, certainly if you're trying to issue an urgent fix under pressure you definitely don't want to be following manual build steps. So having a scripted build, having a reliable development and release process is going to be a major boon when it comes to making a release under pressure and of course it's really important for your developer services saying developers are important for your software supply chain security but it's not going to provide any benefits for security or compliance monitoring reproducing bugs in the future and isn't going to prevent a situation where your builds are breaking because third party components have disappeared. So as you recall the next level of reproducible builds is rebuildable builds which is a level in which we control the explicit inputs for a build. The rebuildable builds are build processes that can reproduce two equivalent but not necessarily identical artifacts at arbitrary points in time and get the same output. So if your build is rebuildable it implies that any intermediate build artifacts are stored and under your control. This means that you likely have some sort of caching mechanism for intermediate build artifacts and those artifacts can reliably be retrieved at a future date. It means that you're not pulling artifacts from third party locations but instead you're mirroring them internally or rebuilding them from source which is also under your control and compared to repeatable builds there's a significant effort increase to achieve rebuildable build status so an example of a repeatable build might be a docker file or a rebuildable build excuse me it might be a docker file with dependencies that are being pulled from an internal repository with the pulling from an internal repository really being the key. It doesn't mean that you're only specifying a version of a dependency you're installing and it certainly does not mean that you're using a generic tag to specify your base image rather it implies that anything being installed in your container comes from a controlled repository and this can be really difficult when you take into account that you have base images you have direct dependencies you have transitive dependencies those can add up very quickly and since I did use a docker file example I'd also offer a word of caution regarding the layer caching mechanism that docker uses to build container images so as you might be aware each docker file command corresponds to a different layer in the resulting container image that gets built and as it builds docker locally caches each of these layers on disk so that it's faster the next time you build because docker will use the built layer served from your local cache and each time the build is run on that same machine docker will use the local cache for that layer unless the docker file run command changes or files being copied in have changed in which case subsequent layers will be rebuilt and this caching can kind of give a false sense of security to developers right developers think great as long as my docker files recorded my build is cached therefore intermediate artifacts are stored and I have a reproducible built however if you are using fresh environments for your builds or the builds are happening on temporary ci systems that cache is not going to be available which means that your build is not actually reproducible now you might be thinking okay I'm following a build script I control the intermediate build artifacts what would make the final artifact not binary identical at different points in time and this is something that can happen when dates times build paths those types of things are embedded in the built objects either by the build system or otherwise so if this happens the final artifact is likely going to differ at different points in time especially if it's built on different build systems in a perfect world these variables like time or date stamps and build paths would either not exist or be declared as explicit build inputs when date stamps or build paths are included as part of your build it makes it impossible to reproduce bit for bit even though arguably in most cases slight variations like time stamps won't actually change the fundamental behavior of the software but if they exist they do make it more difficult to detect malicious anomalies and impossible to automate any checks for reproducibility using Shaw sums when you're comparing final artifacts so in practice rebuildable builds offer more assurance than not and certainly more so than repeatable builds controlling the explicit inputs for a build offer protection against a scenario where build artifacts may become unavailable at a future date and as Joshua kind of touched on before this might happen maliciously say when a developer suddenly yanks all their published modules or intentionally publishes malicious code in protest or it may happen harmlessly like when distributions garbage collect old versions of packages as they publish newer versions rebuildable builds improve the reliability of the entire development and release process and they ensure successful business continuity when providing delivered software and support to customers and I know in this SAS and containers world that a lot of us operate in it's hard to imagine that there's software out there that isn't running on latest but as someone who has worked at IBM let me tell you that it exists there's embedded software running on legacy hardware that is years old and if you're a software vendor without any way to rebuild old versions or at the very least access some copy of an older version there's no way an engineer is going to be able to reproduce and debug a customer issue in the field just upgrade your software is good advice but it's unfortunately not always a feasible solution and if you're providing support to customers in a situation like this rebuildable builds are prerequisite for remediating customer issues and like anything in life there's tradeoffs and this is especially true when achieving rebuildable build status and that tradeoff that fee primer primarily presents itself in the form of infrastructure costs so if you're mirroring dependencies and storing intermediate build artifacts you need some place to put them and with any sort of scale that storage is not going to be free and while I personally think the cost pays for itself and benefits it is still a cost right and if you're not a company with Google sized resources this might be a factor in how you move towards implementing rebuildable builds okay so now we get to the third level definition and this is the only one where we actually include the word reproducible in the name so binary reproducible builds as Rose already said you might hear them referred to as deterministic builds or hermetic builds and effectively the definition is quite simple when I rerun a build with the same inputs I get bit for bit identical outputs regardless of who where and when and so the build environment must be fully defined in order to be able to achieve this we need if we want to provide the same inputs to the build we have to fully define what all of those inputs are and a reasonable modern example of our achieving a reproducible build is with the go program language you know you can vendor your dependencies which fetches all of the dependencies from the network and includes them in a copy in a folder in your project tree and then you can perform the build with the vendor dependencies you can use the trim path command to remove the path prefixes so that only the relative path gets included in the objects and you can pass some linker flags to remove some other things from the built objects so that you could if you ran this go build command on a different system at a different point in time in the future as long as you had the vended source in your project tree then you would get the same output and when I say same of course mean bit for bit identical output so a binary reproducible build is inherently reproducible all the time starry happy face but of course as implied throughout this talk it can be a loss of frowning parent face so the amount of work required depends a lot on the language and tool chain and use it tends to scale with the complexity of the system we just demonstrated it's fairly easy to produce a reproducible go binary and the same is true for some other ecosystems like python wheels nowadays are fairly easy to create a binary reproducible python wheel but it gets more complex if you include the transitive dependencies of the python wheel or if your code base has multiple implementation languages if you're using java and c++ you're probably going to have a lot of work ahead of you maybe you're running on containers as rhodes already indicated they could be quite a lot of work to manage to achieve reproducibility for all of the layers involved if you've got a combination of all of the above you've probably got significantly more work involved in achieving binary reproducible builds so solving this problem for all of the components you ingest into your project third and first party can be massive the supply chain implications of binary reproducible builds are quite significant a major advantage is you don't have to impressively trust the machine you're doing your builds up and for most software engineers you don't can we and we don't control the builds we're doing our machine the machines we're doing our builds on we're using cloud services ci infrastructure as a service whatever it is you know we they're just black boxes we're running our builds on so being able to verify reproducible builds elsewhere is is great it's really good for and just the engineering effectiveness of your entire project that would argue it can help improve the reliability of the entire development and release process because your chance of flaky builds is effectively diminished because if it's binary reproducible you would expect that you can reproduce the failure as well as the success it provides cutting benefits makes delivery optimization easier but as i discussed it it can be very difficult it's not without its cost which is why despite the fact that many smart people have been working towards achieving this goal for a long time it hasn't been achieved by default for many projects so now we know what reproducible actually means when i run my build at two arbitrary points in time regardless of the build machine i get the same bit for bit identical output i don't need to inherently trust my build service which gives me choice gives me flexibility and peace of mind and maybe you've listened to the talk thus far and thought okay i understand the progression i understand the differences between these levels but why should i care about this of course binary reproducible is the gold standard but as you guys have said it sounds hard it is hard and does it really add many more security assurances for the effort that it takes to achieve so should we really care about binary reproducible builds i would argue as open source developers especially we should care i think it's important for all developers but especially for open source developers most of our users will look at us funny if we throw some code over the wall or we throw build over the wall and yet other packages that we're all downloading from open source repositories really that different um our users choose often implicitly to trust the the developers and their source code by its nature source code is verifiable and with distributed revision control systems it's easier to detect tampering this has happened in the wild with git and commit signing also helps provide some additional guarantees here too but between accessible source code and conveniently usable binaries there are um the build systems and the platforms on which the binaries are produced and the content repositories we store those binaries in and these are all places where a backdoor or unauthorized tampering can be introduced verifiably reproducible builds provide a way to prove a binary matches the claim source even if the user does build from source on their own systems a verifiably reproducible build provides them a convenient mechanism to reduce the need to trust their own systems if not the tampering so given the uh advocacy for binary reproducible builds what can we do to move towards uh making this a reality before we uh devise a plan for how we can implement this how we can move towards it in the future i think we need to also understand what's preventing us from achieving binary reproducible builds right now the first is a lack in common understanding and maybe even verbiage for what type of reproducible reproducibility level your software is actually achieving so i've spoken to well-intentioned folks who think their build is reproducible because they have a docker file which in their eyes is a record of how to rebuild their container it's reproducible and if you take a very loose definition of reproducible this is technically true you can reproduce the steps to generate the container but as we've covered there's a lot that is not reproducible about this and it's especially not binary reproducible which others might colloquially understand to be the default definition when somebody says i have a reproducible build there's also purely a realistic aspect to this in that effort and resources are required to achieve binary reproducibility and this gets harder to do the further up the stack you go and as the scope and scale of your software increases a project with no direct dependencies is going to be much easier to binary reproduce than a project where you have to mirror and communicate a thousand dependencies right um reproducible builds are also hard to achieve today as tools can sometimes work against us and i'm not picking on docker here but i think docker is a good example of why this can sometimes be hard docker and the way it builds containers in a layering format using local cache and the way that most docker files are written are uh to pull from latest and to pull from random corners of the internet and i think this demonstrates how what can seem like reasonable decisions um or reasonable tradeoffs can make achieving binary reproducible builds harder developers can't utilize the default functionality of the tool um and they can't if they can't do that then they can't get build reproducibility by default instead they have to consciously make and implement other trust decisions that require more work to achieve so what can we do then as individual developers as teams working on software projects to work to build towards reputable builds um i think the first thing is understanding the factors that affect build reproducibility and avoiding those bitfalls and i i hope that our talk and benard's talk earlier um provide some good uh impetus and some some good information and the reproducible builds website is really a valuable resource for understanding and some of the specifics there i also think it's important to scope appropriately so things you directly control your projects the container images you produce etc are much easier to affect than trying to solve this the entire ecosystem in a single swoop incremental progress is going to help you um feel like you're achieving things uh it takes a long time to go especially for complex software projects it takes a very long time to go from not at all reproducible to fully reproducible um and we see this in for example the microsoft windows uh operating system has been building towards reproducible builds across several major OS releases i'm not sure if they achieved it fully yet but um it's certainly taken a number of years for that uh operating system to achieve um the gains they have i also think it's important uh to adopt the mechanisms that exist for your ecosystem so python and go are well served but there are also efforts happening in other ecosystems and they might not be happening in the default tools but it's much easier to change one of the tools you're using to produce your software than it is to rewrite all of your software um and finally i think uh if your tools don't support reproducible builds if you're in a position where you can choose new tools i would recommend that you select tools that do support reproducible builds but otherwise i also think it's useful to um file a feature request uh and and make the people producing the tools know that this is a priority for their users i think it's really interesting that um there's a the original motivation for the unpackaged manager for JavaScript was actually being able to manage dependencies consistently across machines being able to reproducibly uh deploy Java packages um so i think you know the more tool developers understand that we care about these things the more they will prioritize the features we care about and there's also work we can do as a community the open source something's all about open source communities i think um some of the some of the ways we can come together as a community and improve build reproducibility is investing in both the ecosystem and the tooling and by investing i mean both uh monitoring investments but also you know time investment actually collaborating with those ecosystems so the reproducible builds project we've talked about a few times so um so a working group in a collaboration space that develops standards produces tools and um define specifications to enable projects to achieve big build reproducibility the Debian and SUSE projects have been deeply involved in uh reproducible builds for a long time and both of their Linux distributions are trending towards being you know significantly uh reproducible um there's a great project called rebuild-a-d which can monitor our packages uh monitor a distributions package repository and attempt to verify whether the packages that are put published by the distribution are reproducible so the more um rebuilders that are deployed to verify these reproducible builds the better for the entire ecosystem if not everyone has to rebuild from source to verify that the built artifact matches the source uh we can gain confidence through helping each other out through rebut you know pooling our resources and and rebuilding uh across multiple systems and developing some kind of confidence metric based on the number of systems which have been able to reliably reproduce uh and I also think the SUSE project is is worth a look for anyone that's interested in software supply chain security and uh build system security the SUSE project provides guidelines for protecting against tampering with the projects and the build systems and recommends reproducible builds as a way to achieve that at higher levels reproducible builds are the first ripple in the water as we move towards engineering excellence and a more secure software supply chain but why should we stop there now that we understand the principles and motivations behind reproducible builds why not apply those everywhere so yes reproducible builds but maybe reproducible build servers reproducible infrastructure let's aim for full traceability and confidence in every part of the development cycle if we verifiably derive as much of our pipeline and infrastructure from source code we can require multi-party approval of changes and redeploy with ease this idea of reproducibility also ensures that we don't need to keep anything for extended periods of time so we can just recreate it when we need it instead of storing it and this idea of short-lived everything in the development pipeline means more secure software all around it means even if bad actors are able to gain access to a build system one time they don't retain that access and by regularly redeploying ephemeral environments and hosts we can protect against root kits and other persistent threats applying consistent and reliable reproducibility principles to our software development pipelines and infrastructure is not just good for you and your software it's good for the entire ecosystem it provides transparency as to how things are done which makes it harder for bad actors to hide and it provides an audit trail which makes it easy to spot malicious actors when they attempt to intervene we're confident that the industry is headed in the right direction as we see an increased focus on reproducible builds today and an increased focus on development best practices and we certainly hope that this talk was informative and maybe you even learned something so thank you so much for your time we're here to answer any questions and continue for their discussion thank you