 Thank you So this being an academic talk has announced I will try to bring some of my research I did during my PhD into the real world We are going to talk about the security of Software distribution and I'm going to propose a security feature that adds on top of the Signatures we have in up to today and also the reproducible builds that we already have to a very large degree I am going to highlight a few points where I think Infrastructure changes are required to accommodate this system and I would also appreciate any Feedback you might have I'm going to start off with a few Motivational words why should we care about the security of software distribution? We already do have cryptographic signatures. I have just put away a few examples of recent attacks that Involved the distribution of software where People who presumably Thought they know what they were doing had grave problems with software distribution For example the juniper back doors pretty famous Juniper discovered two back doors in their code and well nobody really knew where they were coming from another example would be a chrome extension developers who got their credentials fished and subsequently their extensions backdoor or another example a Unassigned update to banking software actually included malware and infected several Banks, so I hope this is a motivation for us to Consider these kinds of attacks to be possible and order to and to prepare ourselves So I have two main goals in the system I'm going to propose the first is to relax trust in the archives in particular what I want to achieve is a Level of security even if the archive is compromised and the specific thing I am going to do is to detect targeted the back doors that means back doors that are distributed only to a subset of the population and What we we can achieve is to force the attacker to deliver the Mava to everybody thereby greatly decreasing their degree of stealth and increasing their Danger of detection, so this would work to our advantage The second goal is forensic auditability Which overlaps to a surprising a degree with the first one in technical terms In terms of implementation So what I want to ensure is that we have inspectable source code for every binary We do have of course the source code available from our packages, but only for the most recent version everything else is a best effort by the code archiving services on the mapping between source and Binary can be verified once we have reproducible builds to a large extent I want to make sure that we can identify the maintainer responsible for distribution of a particular package and the system is also interested in providing attribution of where something went wrong, so that we are not in a situation where we Notice something went wrong, but we don't really know where or where we have to look in order to find the Problems, but that we can be we really have a specific and and secured indication of Where a compromise or problem was coming from? quick let's quickly recap how our software distribution works we have the maintainers who upload the code to the archive the archive Has access to a signing key which assigns the releases actually meta data covering all the actual binary packages these are then distributed over the mirror network from where the up clients will download the package meta data that means the Hash sums for the packages their dependency sensor one as well as the actual packages themselves This central architecture has Important advantage mainly it the mirror network need not to be trusted right we have the signature on That covers all the contents of binary and source packages and the meta data So the mirror network need not to be trusted on the other hand it makes the archive and its signing Key a very interesting target for attackers Because this central point controls all the signing operations So this is a place where we need to be particularly careful and perhaps Maybe even do better than cryptographic signatures So this is where the main focus of this talk will be although I will also consider the Uploaders to some extent So we want to achieve two things resistance against a key compromise and targeted backdoors and to get some better support for auditing in case things go wrong and the approach that we Choose to do this is we want to make sure that Everybody runs the exactly same software or at least the parts of it. They choose to install if we think about that for a moment this gives us a number of Advantages for example All the analysis that stand on a piece of software immediately carries over to all others other users of the software Right because if we haven't made sure that everybody installs the same software They might not have exactly the same version and perhaps some a backdoor version This also ensures that we cannot suffer targeted backdoors in thereby increasing the Detection risk of attackers and we will also want to have a cryptographic proof of where something went wrong Now to look at some Pictures I Will present the data structure that we use in order to achieve these goals the data structure is a hash tree a mercury Which is a data structure that operates over a list So we have a list of these squares here which represent the list items in our case This is going to be files containing a package meter data such as dependencies a hash sums of packages And also the source packages themselves are going to be elements in this list The tree works as follows it uses a cryptographic hash function, which is a collision resistant compressing function And the labels of the inner nodes in the tree are computed as the hashes of the Children, okay, so once we have computed the root hash of the root label We have fixed all the elements and none of the elements can be changed without changing the root hash We can exploit this in order to efficiently prove the two following Properties for elements First of all we can efficiently prove the inclusion of a given element in the list if we have if we know the tree root Ex ante this works as follows. Let's make a quick example We see the third list item is marked with an X and if I know the tree root then the server operating the Tree structure Will only need to give me the three gray marked labels the three gray marked node values and then I can Recompute the root hash and be convinced that this element actually was Contained in the list the second property is that we can also efficiently verify the append only operation Of the list so we can have a lock server operating this kind of structure and the lock server need not be trusted It's not going to be a trusted third-party, but rather its its operation can be verified from the outside So what does this design look like the theoretical foundation is called a transparency Overlay in our system. It looks like this. We have the archive as per usual We add a lock server and the archive will submit package metadata the release file the packages file containing dependencies and so on and the source code into this lock server The client will be meaning the up client will be augmented with an auditor component and this auditor component is responsible for verifying the correct log operation as well as the inclusion of the downloaded release into the lock this is the mechanism with which we will be able to make sure that Everybody is running the same exact same version of the software they they installed a third component is the monitor and the monitor is necessary also to verify log operation and also to inspect the elements of that are contained in the log so the monitor would then be run by Groups of individuals or individuals that want to make sure of certain properties In the log All right Let's quickly Recap so we have added this lock server which can prove two properties efficiently to the to the outside world And we have the auditor and monitor component where the auditor is added to the up client and the Monitor does an additional investigative tasks now in order to make this system work we need to I Need to make a few assumptions So the archive will need to handle lock submission and distribution of certain lock data structures. These are Usually very small things that are given to the archive in response to submission Then we I'm assuming a very consistent release frequency the archive is Responsible for distributing reproducible binaries in my architecture. I'm assuming that the build info file is a build info files are covered by the release file. I treat them as additional source metadata, so whenever the source package or the build info file Changes I expect an increase in the binary version number I also assume source only uploads and one additional thing that we have a keyring source package that is treated by the archive as authoritative and this key ring Must have the special property that is operated append only so that we can go back in time and see what keys were authorized to different points in time The lock server Is a standalone server component that speaks and at the moment an HEB based protocol Probably one would want to have more than one But we are going to have I think a much easier time running lock servers than for example the certificate transparency people because we only have one source of writing access namely the archive so we can easily schedule The the write access and you can have read read only front ends that aren't quite critical The auditor component would need to be integrated into the up to client or Library it needs to do things like a cryptographic verifications Understand a bit more file formats and some more network access Yeah, um parts of the proofs we could also probably distribute over the mirror network And we need not only we need not necessarily do everything alive in communication with the log server Okay, um, so this covers archive Auditor and the lock server the monitoring servers Have a few functions that are necessary for verification of the log itself meaning they verify the append only operation of the log and Will also likely want to exchange a tree roots with perhaps other monitors and some auditors the important verification functions of the lock server are validating the metadata of the Release packages and sources fires Namely making sure that these are complete that sources are available that versions are incremented Correctly and so on and that's necessary to make sure that a compromised archive can't do a certain attacks also in this category is the fact that we depend on a fixed release frequency and Monitors will also be verifying the upload ACL meaning which keys are authorized to to upload Monitors also would be verifying reproducible builds in this scenario Okay, so that's the monitoring functions and I think that people many different people and groups in Debian could Get some benefit out of these monitoring functions in order to verify that Everything worked correctly. We should note that all these verifications are completely independent of the existing Infrastructure because they're happening on the client side so we don't need we don't depend on any notifications from the Existing infrastructure that that works correctly and no notifications are stopped this can be done completely on the client side using the data provided by the lock server so for example maintainers could verify that the Code they uploaded builds were very far builds are reproducibly Using the corresponding build info or they could have checks which uploads were done using their key Which of their packages were modified perhaps by other people the key ring maintainers or account managers Could be looking at the key ring What keys are in the key ring and what uploads were done using which keys? and the archive last but not least Ken has an additional verification a step available to make sure all the meta data was produced correctly and you know Weird things happen during the production of a given release Yeah, so That's this thing actually exists. Well, I have programmed prototypes for all these components Meaning nothing that would be ready to implement but to show that it actually works I have used two years of Debian stretch releases and fed it into the system this resulted in a tree size of 270,000 elements and the storage required was about the 400 gigabytes Well, almost all of that is the source packages So I would say that it's eminently feasible to do this the monitor functions run Rather cheaply and monitor need not necessarily keep a complete copy of the log in all cases But what I noticed is some unexpected events in the package meta data I have observed sources missing and Version increments missing where I think there should be a version increment. So I'll be looking more closely into these cases Okay if anybody is interested at the theoretic side of this these would be the Immediate pointers I can give the first paper lays the theoretical and mathematical foundation and the other ones are Applications of similar transparency work, but Yeah in with different different goes Okay Summarizing we can Introduce a system to detect targeted back doors even under compromise of the archive We need to add a bit more infrastructure and need to change how some things are done We also can improve the audit ability of What we can identify securely identify when Things go wrong In particular we can make sure that for every binary we can get the source code that was that was Used to produce the binary and then identify the responsible Maintainer there's one class of attacks. I have left out for today. If anybody wants to talk about that We can do so too And now I'm interested in your question and the feedback So do you already tested the reproducibility and how do you? interact with problems of Not reproducible packages. I mean do you not integrate them into the log? For for now, I haven't my monitor of the implementation of my monitor functions hasn't covered Reproducibility I think the first step to do so would be to have a Blacklist of packages that are known and not to build reproducibly and then try to get on with it So two questions you say Authenticating metadata and code this means signing or what is it exactly? Authenticating at which point? Oh Where the tree is yes, yes, yeah, so put the tree before that. Yes Okay now this refers this Authentication here doesn't quite mean a signature. It means if I know the value of the root of the hash tree Then I can be assured that a given element is included if I are told the if I am told the value of the Three gray marked inner nodes here and that works by Recomputing the hash Tree Okay, I think I have to defer this after the talk. Yeah, I can explain it. Yeah another question would be So detection of targeted backdoors you mean this at the stage of Signing archiver or which backdoors so the scenario would be that the signing key of the archive is used to create an Additional release file which covers a manipulated software version and this software version and signature is only Shown to the victim population and not to the general population This means that the militia software will only be observed by the victim and not by everybody else and My goal is to force the attacker to distribute the militia software to To the whole world in order to increase the chance that they are going to be detected and thereby deterring Perhaps the attack from the beginning Hey great talk great ideas as well I really like your slide on your assumptions You're really honest about them like yeah, we assume all these things I wouldn't underestimate how difficult it would be to make some of these changes I mean even ones that look simple like source only uploads like everyone wants them, right, but okay Yes, sure. We have to start somewhere and I hope if people are Convinced that this is a great idea and we should do this then maybe get some more impetus for these things that everybody wants Like source only uploads. Yeah. Yeah. Thank you. Yeah, be ready. And it'd be really good to base all this stuff on Thank you Yeah, so I'm Interested in any kind of feedback if you think this a great idea or I think there are some problems I might have missed it might get difficult to implement Please come and talk to me in case you have anything