 Welcome to my talk, Software Transparency, Package Security Beyond Signatures and Reproducible Builds. In this presentation, we are going to talk about the security of software distribution. So how can we make sure that the software that lands on the machines of our users is actually the software they intend to run? And for this, we propose to add a new security feature on top of the existing systems like SecureUpd and ReproducibleBuilds will also play a big role in this. Of course, the introduction of new features doesn't work without changing a little bit of the existing infrastructure, so we are also going to take a look at how we can approach that. And then I'm, of course, very interested in feedback, your ideas, what do you think might be a bit of hurdle, things like that. So let's start off with some examples where there were problems in the security of software distribution in the area of free software. There were problems with Chrome extensions where developers were phished. Their credentials basically got lost and malicious extensions were uploaded. There were malicious images on Docker Hub. There were compromised NPM packages, the Gen2 GitHub quite recently, and also quite recently, the problem with the RG user repositories where malicious packages were uploaded. The mature social and organizational structures in Debian are quite a strength in this regard, but nevertheless, we should prepare for things going wrong, I think. Let's take a look at the way the existing software distribution works in Debian. We have a person who uploads a package, all the packages together then are distributed by the archive. They are signed by the archive with the signing keys. And then they are shipped off to the mirror network, our content distribution network, where all the packages and metadata, meaning dependencies, which versions are available and so on, are then distributed to the clients, which can download them over HGP or HGPS. This architecture is quite nice because it allows the part of the architecture that has to scale well, in particular the content distribution network, to be untrusted. So it doesn't matter who runs a package mirror for us. They can't inject packages in a visible way for the users. That's achieved by the archive signatures. It does, however, have drawbacks. The archive signing key, on the other hand, is all powerful. So if anything should happen to this signing key, that would be a big trouble. And of course, we also trust our uploaders that they only upload the appropriate software. So how can we improve the situation? Well, we are going to approach this with two goals in mind. The first is to reduce or relax the trust in the archive. What we would like to achieve is that it's impossible, even if the archive is compromised, to deliver malicious packages to specific users. We can't avoid malicious packages being delivered to all the users, but we can avoid and want to avoid malicious packages being delivered to specific users. And the second goal is that we improve the auditability. We have a secondary audit structure in place where lots of important properties can be verified independently of the existing infrastructure. For example, we would like to make sure that there is inspectable source code for every binary, that we have a verified mapping between the source and the binary, that we can reliably identify the maintainer responsible for the distribution of this package. And in case something goes wrong, we would also like to have a strong indication of where the problem originated from. And this proposal works on these specific goals. So the approach we are going to take in order to achieve this is that we want to make sure that everybody runs the exact same software. This is the main idea we are going to follow. And we can immediately see if we can achieve this, then we can't have targeted backdoors, malicious packages only being delivered to specific users anymore. And this is a huge advantage because then any malicious package has to be delivered to everybody, which also greatly increases the risk of detection and therefore makes the attack more unattractive to the attacker. So if we have the situation that everybody runs the exact same software, I can go ahead and audit or analyze a particular package and the result of this analysis is immediately applicable to all installations of this package because we can be sure that we actually run the exact same software. There are no targeted backdoors possible anymore and we can also pinpoint where things go wrong if they do so. Now let's take a look at the design we propose to achieve that. The main idea is that we add an additional component called a log server that serves to make sure that everybody runs the same software and will also facilitate the auditing goals. And it's important that we don't add an additional component as another trusted third party, but we really, really want anything we add to be in such a way that we don't have to trust it, but rather we can verify from the outside. So for example, all the clients can verify certain properties of the behavior of this log server. And we are going to keep all the existing infrastructure in place as the archive with the mirror network delivering software packages to the apt clients. And now the archive also publishes its metadata, meaning dependencies, version numbers and so on, and the source packages to the log server. And the log server will also have to be contacted by the clients. We are going to look at it in a bit more detail later. For now, let's take a look at how the log server has to look in order to facilitate the goals that we want to achieve. The log server operates a data structure called a hash tree over a list of elements. And I'm just going to show the basic data structure and one example of what we can do with this data structure and not going through all the properties. So let's start with the list of elements, which are the squares at the bottom of the graph. And these list elements are things like source packages and the package metadata. So the packages.gz file, the in-release file, these kinds of things containing the dependencies and hash sums of the packages. So each of these things is one little square here. And over this list of squares, we are going to compute a hash tree where the parent of each node, so going upwards in the graph, is the hash of its children. And if we do this, we have an interesting property, namely the tree root, the topmost node in the graph, reliably identifies all the list elements. So we can't change anything in the list without the top node also changing due to the chained hash construction in the hash tree. So this tree is operated by the log server. And we are now going to take a quick look at one thing we can achieve using this data structure. So if we are in the situation that we know the tree root and we would like to have assurance that the square marked with x is actually covered by the tree, remember the tree can be very, very large and the client might have, for example, this in-release file and they want to make sure that it is covered by this tree root. So what we can do now is we can ask the log server, hey, please send me the proof that the third element is actually contained within this tree root. And the log server would respond with the nodes marked in gray here. And with these gray mark nodes, we can recompute the tree root and actually convince ourselves that the tree root covers this element in question. The log supports two operations efficiently and cryptographically secure, namely proving that a given element is included in the list covered by the tree root. And also that the list was always operated in a pen-only manner. So there are no other changes to the list other than adding new elements to it. And if we only rely on these properties, there's no need to trust the log server. Rather, we can verify all the important properties from the outside of the log server. And this will be the main method we use to achieve our goal of ensuring that everybody runs the exact same software. Now let's take a look at the whole architecture in a bit more detail. So you'll recall we have the archive, the content distribution network, the mirror network, and our Upt clients who install the software from the CDN. We have added a log server operating a tree data structure. And the archive submits the package metadata, dependencies, hashes, and so on package versions into this log server. And it also submits the source packages into this log server. The Upt client is augmented with an auditor component which serves to verify the log operation. For example, we can verify that a new tree road is an append-only version of an older tree route. And we can also query the log for proof that a particular metadata file, for example, an in-release file, is covered by this tree route. OK, there's also an additional component that we need called a monitor. And of these monitors, there are a few. And some people should be interested to run these monitors because they fulfill important analysis functions for different groups of people. In general, what the monitor does is it verifies the log operation similar to the auditor. But it also does additional checks. And it may require to keep a copy of the tree data structure or all the elements used to construct the tree data structure, or at least they receive all new elements. And this monitor makes certain of many security properties. So if you're unsure how we can achieve a certain property or defend against a certain attack, the answer will usually be that there's a monitor function that serves to achieve this. And we keep in mind that it's enough if one party detects a problem because we have a strong indication of where something went wrong. So if there's one monitor who detects a misbehavior, that monitor should have all the necessary data to prove to the outside world, to the wider community, that yes, there was a problem and the problem was this and that. OK. Yeah, so let's quickly recap. The log server can efficiently and cryptographically prove that a given element was included in the list covered by the tree data structure. It can efficiently prove that the list was operated in a pent-only fashion. The other new components, the auditor, which resides with the apt client and the monitor, which is an additional component. These two verify the log inclusion of important metadata files. They verify the consistent operation of the log server, and the monitor additionally has investigating functions that analyze the data present. Now, so far, the architecture, what are the things that we need to assume in order to make this graph changes in the existing system assumed in order to achieve all these things? Well, starting off with the archive, these are our assumptions. The archive can submit files into the log server. And on submission, the log server returns an inclusion promise. And these inclusion promises, which are essentially signatures over small items, need to be distributed by the archive in order to hold the log server accountable to its promises. We assume that the release frequency is rather consistent. The archive in our architecture is responsible for distributing reproducible builds. Of course, it's possible to have a blacklist if you know, OK, this package won't build reproducibly, for example. We also assume that the build info file that's required to build reproducibly is covered by the release file. So at the moment, we treated as a source package metadata. We also assume source only uploads. And we assume that there's a keyring package that is an authoritative copy of the keyring. So this keyring source package would then be respected by the archive on uploads. OK, so much for the archive. The log server, well, that's a standalone server. One would probably want to have more than one of these. You can have read-only frontends to these, because only the archive is going to submit to it. And I think we are also going to have a much easier time operating these log servers compared to certificate transparency, for example, where anybody can upload certificates, usually. The auditor component should be integrated into APT somehow. And the important parts here are some cryptographic verifications. There are a few file formats that need to be understood and some network access. Some of these things we can also distribute over the mirrors, pick them together into files such that they are usable for most clients and then ship it off as additional metadata, much like we do with other metadata right now. OK, so the log server and the auditor, now for the monitor. The monitor tasks include verifying the append only operation of the log. The monitor will probably want to exchange tree roots with monitors and auditors. The verification functions of the monitor also include checks of the package metadata. For example, is the metadata complete? Are all the parts that are supposed to be there? Are they actually there? Do we have the source available for every package? Are the version increments correct for each package change or dependency change or something like that? There should be a version change such that clients are actually updating correctly. Yeah, I talked about the release frequency that needs to be consistent. We can verify the upload access control list using source signatures and the authoritative key ring. And the monitor, of course, can also verify reproducible builds to verify the mapping between source and binary package using the build info file. OK, so that's the things that are new in this architecture. Now I'd like to discuss a little bit what useful features there are for different groups, aside from, of course, the security and new security properties. These reasons should, in my opinion, also serve as a motivation for different people to run one of these monitors because the monitors can alert them if things go wrong in some way. So maintainers get assurance of reproducible binary. They can have a notification of all uploads that were done using their key. They can have notifications of who did changes on their packages. And all these things are independent of the existing infrastructure. So no matter what is compromised in the existing infrastructure, these would all still be verifiable. The account managers and key ring maintainers have a way to ensure that the key ring is respected for uploads. The reproducible builds people can have assurances of reproducible binary packages. And the archive operators and FTP masters can have assurance of correct metadata. They can have assurance of reproducibly building binary packages, and also that their machines are correctly observing the upload ACLs. So these are the features that are added into the system. We now talk a little bit about what actually exists. Well, it does exist in prototype fashion. So for all the components, there's a prototype. These prototypes contain the cryptographic verification functions and so on. But they are not yet in the form of patches to the actual Debian software and to the applicant, for example. What we did in order to evaluate this prototype, we used the last two years of the stretch development phase. So two years before stretch became stable, until stretch became stable, basically. So that's two years of Debian testing. We fed it into the system. This results in 270,000 tree elements. And the log server needs a storage of about 400 gigabytes, which is, of course, heavily dominated by the source packages. The cost of the monitor is dominated heavily by the verification of reproducible builds because the compiling takes time and CPU time. And we also noticed that there may be some inconsistencies in metadata at some points in time. So for example, that the source is missing or that the version wasn't incremented when I thought there would be a version increment. OK, if you are interested in this topic, here are some pointers that may also be interesting to you. The first one is a theoretical work. The second one is another interesting software distribution security system. I believe there will be a presentation later in the conference. The third one is an idea by the Firefox people to have a similar feature for Firefox. And the fourth one is a new proposal. OK, let's quickly recap. So we can achieve under this architecture the detection of targeted backdoors. We get lots of auditability features. For example, we will be sure that for every binary we can point to the appropriate source code and the source code is downloadable. For each binary, we can also be sure that this is the exact source code that was used to compile it. We can identify which maintainer authorized the upload and we have, in case anything goes wrong, we should have strong evidence or the party detecting that something goes wrong. We'll have strong evidence to convince the public that, yes, there was a problem and the problem was this and that. There's a class of text I didn't really discuss in detail today, so I'm just aware that there's more out there. Yeah, so that would be it from my side. And now I'm interested in your questions and ideas and any other feedback you might have. Hi. So you mentioned multiple times about append-only feature of these trees. So over the decades of running this, when this whole system is growing rapidly, how do you propose to do rollovers of those hash trees or what are your ideas on that front? Do you have any proposals? I don't have a specific proposal. Certificate transparency is ahead of us and they are dealing with this problem right now, basically, and the main idea they seem to be using is having different lock servers for different years. And well, that's one way to do it. Someone on the internet asked that, does that scheme do anything to prevent repair attacks? You mean like avoiding that I'm showing an old version of a package again? Yeah. OK, yeah. So in the literature, that's most of the time called a freeze attack. So if you look at that, and Upt already has a defense feature built in to defend against that. So the release files which cover all the packages, they have a walk lock time until which they are valid. So I think it's probably seven days or something like that for which the release file is valid. What's the status of this proposal? Who is working on this? Who will operate these servers? And what do you plan to do against compromise of these servers? Because then you can root everybody. Yes. So the status is that I'm working on it and everybody is invited to also work on it and any work, for example, on reproducible builds is also helpful, of course, because that work is important and needs to be done. So if the lock server is compromised, that should be something that's possible to detect if malicious operation is done. So the exact details depend on what the attacker would do. And I think there was another part to your question. Who's operator? Yeah, of course. OK, so it was important to me because lots of academic work is as well. We need this cool system and we'll have independent parties operated. So I think that's a very important question. The lock server itself would, in my imagination, would be that it's operated by the project, like other central infrastructure. The auditor component is integrated with the app client so everybody can have it and turn it off or something. And the monitor functions, these need to be run independently as well. So that's one reason why I came up with the different verification functions because different people can get different advantages or might be interested in different assurances. So for example, the account manager team and the key ring team might want to run a monitor that only does the lightweight verification functions and in particular, all functions, verification functions related to keys. And they might not run full reproducible builds verification, which of course is costly. Hello. You said there were some unexpected events. Could you tell a little bit more about that? Did you investigate about the causes and so on? Yes, I plan to do so. But I need to improve my tooling because I have tons of metadata files and navigating through them is a bit painful at the moment. But it's a work that needs to be done. And also, have you run the same set of checks against Buster yet? No, not yet. OK, thank you. That means you have code for this already? I do have. Do you have code for this already? Yes, prototype code. So the verification functions are implemented, almost most of them at least. But I don't have patches for Upt, for example. So at the moment, it's an independent component that does the verification. But I don't have patches for Upt. So that's the. And where is the code? I mean, the URL, it's in Git. I can put up URL here. Hello. You said that testing reproducibility of the builds is very costly. Does that mean you're doing that already? No. OK. OK, I was going to ask you how you're doing that. Yes, I think it should be the only thing that's missing, I think, yeah. How does this integration with Upt work exactly? Not yet. Do you have any idea how it would work with, for instance, Upt Contact Log Server directly? Ah, OK, yeah. Yes, so there are multiple possibilities to do that. For example, to ship the proofs to the clients, we could decide to have the proofs that are likely to be relevant to most clients to have them distributed over the mirror network in order to save the clients contacting the log server separately, or one could have the client always contact the log server. Yeah, so there are different trade-offs possible, and we would then need to decide which way to go. So yeah, multiple possibilities. I'd like to thank you for your work, because we've been in reproducible builds. We've been discussing these ideas since 2014, but never wrote them down this consistently. So thank you for that work. Great. Yeah, the same. You mentioned Upt doing some kind of online querying to the log server for the pieces necessary to verify that a package can be trusted. But a lot of the time apps could be used offline as well. And so the package indices contain signatures that allow offline verification before installing a package if it's been already downloaded. And install time, for example, packages could be coming from offline media. Do you have ideas how offline verification could be done? Is it possible to, along with the .deb, to distribute proofs that the package is just worthy? Yes, it's certainly possible. If we decide to ship proofs over the mirror network, we have to decide, for example, for the consistency of proofs between different tree heads, meaning different tree sizes, have to be consistent C-proofs. I mean, these proofs aren't large. They are quite small. So we would need to pack a bunch of generations, basically, into one file, ship it off. And if the client is basically within range of the tree size, then it would work. And otherwise, an online operation would be required. But I haven't thought a lot about the installation phase, I have to admit. OK, so would be a bit more time if there are any questions. OK.