 Hello everyone. Today we're going to talk about demystifying reproducible builds. And yeah, a little bit about me. My name is Rahul Bajaj. Currently I'm working as a site reliability engineer at Red Hat. I recently graduated from Queen's University and this reproducible builds was part of my thesis. So agenda. So this talk, I have divided this talk into two parts. The first part is which talks about basics of what is a software supply chain, what is a software supply chain attack, and what are the some of the conventional practices that do not apply to reproducible builds. After that, we just introduce what is reproducible builds. So this is part one of the agenda. Let's look at them one after the other. So what is a software supply chain? Most open source software in today's era are part of a larger ecosystem like a Linux distribution. And all these Linux distributions are basically made up of thousands of interdependent packages forming a supply chain. Now, maintaining this supply chain because this is an interdependent packages and it is not easy to maintain this supply chain. And therefore, a lot of attacks we have seen in the past have occurred in this software supply chain. And these attacks, some of the famous ones that we know of are SolarWinds and Minecast, Log4j. This is what we know. So what basically we know about a software supply chain attack? In this talk, we would talk about software supply chain attack in the build phase. So how it happens is, say for example, you have a source code and your end goal is to generate a build artifact. In this process, the malicious attacker would inject some malicious vulnerability inside a build phase. And that's how a software supply chain attack would occur in this case. Now, some of the conventional security measures are not really sufficient. So say for example, you would say that the general industry standard is to sign your packages or sign your code that you have. And that should secure yourself from supply chain attacks. Now, using signed versions is one of the ways to get some security for supply chain, but it's not the only one. So say for example, your developer has created the code. He is in the build phase and some malicious attacker introduces a particular malicious, injects malicious code into the build phase. But even after that, the code will be signed. So even if the code is signed, it could be affected by a supply chain attack. Second thing that we hear a lot is we hear people saying that update your software to the latest version. But that's not it, right? Because the latest version could have the supply chain attack. The latest version could have the malicious code. The third one, monitor software behavior. Yes, this is a way to identify supply chain attack. But the problem with this is the attack has already happened. It has been a while that the attack has happened until we realized that from the behavior. So this is also not an efficient way. And the fourth is review your source code. So developers review the source code, but there is not much that the developers can do in the build phase. They do not have much insight into the build phase maybe, and that's why these attacks are possible. So these conventional security measures do provide a way to mitigate supply chain attacks, but not completely. And therefore the industry is kind of moving towards something called as reproducible builds. What is reproducible builds in simple terms? When I run my build again, I must get the same build artifact. It's a very simple layman terms. It means that now there is this diagram which explains the concept of reproducible builds very well. I have taken it from a research paper and it very well explains the concept. So on the very left side, you see a yellow box. And on the very right side, you see the end goal that is your software artifact, like your build artifact. So in between, you have the build phase. So consider two parts. The first part is the upper one wherein you have a software vendor. Say for example, your Linux distribution. So in that phase, you take your source code, you perform the build process by using the build dependencies and the build tool chain. And you create a build artifact. Now, when you create this build artifact, you basically create a hash for it. Now, on the second hand, you perform individual builds. That's the lower one. You take the source code, you perform the build, and then you generate a build artifact. Now, if those hashes of both artifacts performed by the software vendor and the one by the independent build, if they both match, then we say that the build was reproducible. And if they don't match, we say the build was unreproducible. So a mode of inspiration for my research was through this blog by David Wheeler. He said that there are few softwares or there are few packages that are more crucial and that need to be reproducible before the other packages. So in my study, I tried to find out those packages that developers must focus on and make reproducible before others. So we come to the second part of our agenda, which is the study which I conducted. It's called unreproducible builds, time to fix causes and correlation with external ecosystem factors. This is submitted to the Journal of Empirical Software Engineering, and it's still in review. But for today, we will talk about the glimpse of this study. There are three things that we need to understand, I believe, about unreproducible builds. First thing is we need to understand how much time it takes for an unreproducible build. So initially, a package might be unreproducible. How much time does it take for that build to become reproducible for the first time? That is the first thing to understand. And the second thing to understand in this is that once that package becomes reproducible, what are those changes or what are those things that are performed on that package that it becomes unreproducible again? So we're trying to calculate the time and the effort that developers might require for a build to become reproducible for the first time, and then the changes that might occur to it again, that it might become unreproducible. That's the first thing. The second thing is we perform a quantitative analysis to understand all the issues of reproducibility on Debian website, and we try to categorize those issues into a few categories, and they become root causes. And the third one is that, as we mentioned in the beginning, a package belongs to an interdependent ecosystem. So what are those external factors that might affect the reproducibility of a package? To understand that a package is a part of an ecosystem and that only the developers and the maintainers of that package are not responsible for the reproducibility of the package is what we want to understand next. So just to give an idea, we performed these experiments on 18 million builds data until 2021 from the Debian distribution. If you look at the paper, it also compares Arch Linux and Debian and compares distributions and things like that. But for this presentation, we would consider only the builds from Debian distribution. So let's look at the first research question. It says, how long does it take for a particular package to become reproducible and vice versa? In Debian, so how we do this is Debian basically creates package domains. Now, what are these package domains? These package domains are groups of packages that Debian categorizes. So say, for example, the admin package domain would have packages that are related to administrative stuff like ad user, Ansible, Cron. There are other types of package domains like Net which has NetStart package and you get the idea that games would have 3D chess and Mario or things like that. So we categorize these packages into two categories. One is the crucial kind of packages and the other one is the trivial kind of packages. So we found out that if you look at the graph that we've obtained from survival analysis so this survival analysis shows us that how long will the package still remain und reproducible. So if you see on the top, there is an orange curve which represents the trivial packages and below that there is a blue curve that represents the crucial packages. Now, at a one-year mark, those dotted lines indicate one-year mark and at a one-year mark you will see that there is a huge difference and that trivial packages tend to remain und reproducible for a longer time. Now, this is good news, right? So crucial packages by developers are given more priority than trivial packages as of today. But, however, once they are fixed, right? Once they are fixed, we found out that trivial packages remain reproducible for a longer time when compared to the crucial packages. Now, one reason for this we thought is we suspect is that crucial packages undergo multiple changes and a lot of changes and because of these changes, they become und reproducible again and more effort is needed. So for concluding what we have seen till now is that making packages reproducible is not easy. Currently, developers are prioritizing the correct packages to become reproducible at a scale. So we move on to the second resource question. It says identify issues that lead to und reproducible builds. For this, we perform a qualitative analysis and we found that in the previous literature that we have, we found that there are only six root causes that have been mentioned, but in our study, we found out that there are 16 root causes and that there are four, which are divided into four major categories. The first one is build, the second one is file system, third is memory, and the fourth is system. So I will go through a few of these root causes and because explaining each one is not possible, you can read the paper, but giving an idea of the few which are the most important, let's go with the first one that is from the build, which is the build path. Build path is basically, you can say that when two builds are performed on distinct machines and when the artifacts are generated, they are generated different, und reproducible because of build path. Now why? Because in one of the build path, a relative path was mentioned and in another build path, there was an absolute part given. So this is an example of how builds become und reproducible because of build path. The second example would be build timestamp. The build timestamp is the time at which the build was actually performed. So for two distinct builds, the time would not be similar and the build would be und reproducible. For this, the reproducible builds community came up with a flag called as so state epoch, which what it does is basically it pulls the latest commit hash from the changelog and that should make the builds reproducible. The third one we will talk about is file system ordering. The order in which files are being displayed may not be similar in two distinct builds and cause und reproducibility. There is randomness. Randomness is basically when you have, say for example in Python 2.7, those data structures like tuples and dictionaries, when you use those data structures, when you display data from those data structures, they might display the data in a different order and that could cause und reproducibility. You could use functions like sort to handle those situations. There is encoding. So if you encode a particular snippet and that snippet is encoded in different build systems with different encoding mechanisms, then the builds would be und reproducible. So you get an idea. So there are two build systems. How do you test builds in Debian is you have two build systems and you try to build your packages on those two build systems and then if you face such errors, your build would be und reproducible. So one of the interesting findings that we found through our research is that previous literature claims that build timestamp is one of the major causes of und reproducibility. Whereas we found out that although the frequency of issues by build timestamps is more, but the affected packages by build path and randomness are more than build timestamp. Another interesting fact that we found out is that packages might be affected by multiple root causes and this would make it more difficult for developers to make the builds reproducible. There are two challenges related to this. The first one would be to identify which root causes are causing und reproducible builds and the second one would be to actually fix them, to identify and then fix more root causes is much difficult than having a single root cause. Our last research question addresses the ecosystem that the package belongs to and whether any other ecosystem factor is influencing the package to become und reproducible. So our first external factor is called the build dependency. So we found out that reproducibility of a package might depend on the reproducibility of its build dependency. To prove this claim what we did is we found the top ten most influential build dependencies meaning that we found those build dependencies which are used most by packages and Dev Helper and package config were the two most used build dependencies. Now Dev Helper if you see the graph you will notice that in the past six years the builds have been reproducible for Dev Helper. It is expected and we have seen that the packages that it builds for which it is a build dependency 85% of the times they have been reproducible. Whereas in the case of package config from the past three years you see the builds are und reproducible and because of this we see considerably 25% of the builds to be und reproducible. The packages which use package config as their build dependencies turn out to be und reproducible. So this proves our point. One interesting fact that we found is that you guys remember package domains right? They are categories of packages that Debian defines. So one of the things that we notice is that the lib-develop package domain consist of the lib-develop package domain. Whenever packages in the lib-develop package domain are und reproducible the packages using them as build dependencies turn out their reproducibility is affected either positively or negatively but it changes with whenever the lib-domain packages reproducibility status changes. Now what are these lib-develop packages? These are mostly GCC packages. Most of the GCC packages fall into this lib-develop package. The point I want to make here is that GCC is a different project altogether. They have their own methodology of working and they have their separate way of creating a pull request and then getting things merged in them. While Debian is performing this kind of initiative of reproducibility they must understand that in this particular case, in one of the cases we found that to make a package reproducible or to change a flag in GCC if they did not make the changes in GCC they would have to change 3,000 or 3,000 plus packages in Debian or their dependencies. So what I want to say is that reproducibility depends on other packages as well and that if GCC maintainers take time to merge this kind of change then those packages which depend on the lib-develop package might remain unreproducible for that period of time. So our next external factor is the runtime dependency. Now this runtime dependency is of the build dependency that we used earlier. So the runtime dependency of the build dependency is also one of the factors on which reproducibility of packages depend. We found out that 51% of the runtime dependencies are also build dependencies. So in conclusion, what I want to say is that packages which have dual responsibilities of being runtime dependencies as well as build dependencies must be prioritized for becoming reproducible. So key takeaways from this presentation is that crucial packages become reproducible faster when compared to trivial packages. This is a good thing for us. Second thing is build path is the most influential root cause while build timestamp has a greater frequency of issues reported. Packages affected by multiple root causes are more difficult to become reproducible and that external factors that belong to the ecosystem in which packages are built are vital to the project to be reproducible. So yeah, that is all I have. If you have any questions. Yeah, I wanted to mention something. You say that the build paths you discover them to be the most influential. The biggest cause of reproducibility. So what we found is that when we when we took the data and when we tried to see how many packages were affected by which root cause we found that more packages were affected by build path when compared to build timestamp or any other. Yeah, right. What I wanted to say is that the reason for this is actually that build time timestamps were huge issues in the past but thanks to source data they were quite easy to just basically hide them because timestamps were in the past included in like all kinds of archives for example and all kinds of files and time whereas we just adopted source data and that just cancelled out all of that and after that just build paths are just so much harder to get rid of. Right. I see. Okay. Mostly that it used to be bigger and now it's not anymore. No, it's not anymore. Correct. Because we retrieved the data for 2021. So I think by then it was already kind of addressed. The timestamps were huge. Like if you look at the graphs of how reproducibility improved the use sheet that you just ticked up suddenly when we developed Yeah, I think that was around 2014 or something like that where there was a huge spike and then Yeah, I think 15 probably. 15, yeah, exactly. Very early. Thank you. Thank you. I'm sorry if I missed this but I was wondering in your study how did you measure whether a package was reproducible or not in terms of building it? So we did not build those packages by ourselves. This is a data set which we got from the reproducible builds itself. Like Debian actually stores all their results of all their packages at in a database and then we pull that database to do the study. Awesome. Do you know why the developer is somewhat effective in the building in these parts? Yeah, I think because for building all the packages it's mostly used as a build dependency. Almost there. I will go there because from some time ago, Debian adopted something like the P builder using CH hood similar that they do it for example for build systems like Bazel. So it's a very restricted environment, even control, even see groups of everything that completely affect the building. So that's why the results are very good right now. Yeah. That's in the package config can do that yet. Awesome. Thanks. Any more questions? Thank you.