 Hello, everyone. Welcome. My name is Jossi Kokkonen. I'm here today with this impressively vague talk title. And I promise the title's brought by the talk really isn't. I'm going to talk about a specific security feature and implementation of the update framework for a specific package repository, the Python package index. And we'll see how much we'll learn from that example. And I believe that there are generic takeaways from this specific example. I'm probably not going to try to unwrap them too much in this short talk. I'll just tell this short story about this one security feature and how it has succeeded and maybe more interestingly how it hasn't succeeded in improving package repository security. Basically, what we can learn and what approaches we could take in the future. So the package repositories that I mean in the title are specifically community maintained language ecosystem repositories. The example is PyPI, but I think the lessons that we might be able to learn are applicable to any of the others that kind of fit that description. And I would be cautious about generalizing to other things like let's say Linux distro repositories because they look similar but they have these small but kind of important differences at least for this specific case. My background is that I worked on open source and supply chain related things for a while now and I'm one of the maintainers of Python Tough which forms a part of this security feature that I'm going to cover. My plan is to explain the problem that's being solved or that we're trying to solve and show you the potential solution and then document or analyze how this solution has been applied to PyPI, the Python ecosystem repo, and after that kind of continue on looking at what worked and what didn't and those lessons that could be useful. So I should start by defining the problem that we or people before me started to try to solve. On the very high level it's just that community package repositories like this are not safe. PyPI and other repositories like that were designed in a more innocent time when supply chain security wasn't really a thing people even talked about, not in this context. And you could say that these repositories are successful partially because of that insecurity. Part of the appeal is that the repository is a sandbox where anyone can publish anything and there is no expensive publishing processes involved or something like that. And the specific issue that this story is about is that the repository itself, the service, is a massive single point of failure. So what I mean is that the repository infrastructure, like if PyPI.org website is compromised then any clients like PIP that download anything will be compromised. And then there is a separate issue that I think is very much linked is that even if developers of those original packages try to do better and let's say they sign their packages no one later in the supply chain does anything with those signatures. And now you might think that this is a fairly simple problem we just use more signatures. Make download or clients verify signatures that are signed by either the repository or developers or both. And yeah, that is kind of true. Cryptographic signatures are the obvious solution to problems like this. They provide resilience against infrastructure attacks and are proven to work. Distros and like Dubian have this working and so do many other software delivery mechanisms. But there are differences between those systems and these language ecosystem repositories that kind of mean that the language repositories can't easily copy the solutions from those other systems at least not without radically changing the nature of the repository that I mentioned that it is a free-for-all. And there's a couple of things that you need to make signing viable, signing useful. You need to somehow be able to delegate trust. You need to define who is trusted to do what, which kind of keys are allowed to do what. And then you need some sort of key management and I don't just mean how private keys are protected but how the public keys are propagated. How do the downloaders know which keys to trust at a given time? And these issues have to be solved even for repository signatures like if the repository is signing things on its own but they really become a major pain point if you're having the developers sign things. And what makes repositories like PIPI uniquely tricky in that is just the scale and velocity of things. So if you think about the number of maintainers, number of signers that would exist in this hypothetical PIPI where developers signed, that is an amazing amount of keys and identities to keep track of and just easily becomes a key management and key propagation nightmare. And if we think about other methods that other systems like this are using, maintainer, identity verification, trying to keep track of who people really are, can we trust them. It's also practically impossible if you have hundreds of thousands of signers. And then the frequency of releases is another thing because if you think about the repository signing things whenever things are made public, you could do that fairly securely if you make releases once a week or something like that or how a stable Linux distribution might work. But when your releases happen every few seconds then it means that the signing key has to be online and kind of available to processes that can be compromised. So it's not the easiest case. But we do have a potential solution to at least some of these issues. And that is tough, the update framework. It's a secure content delivery framework. This is what I've been working on for the past two years or a bit more. It was originally developed in 2009, University of Washington that was Justin Samuel, Justin Kappos wrote a paper with a couple of Tor developers. Survivable key compromise in software update systems that led to a specification and a reference implementation. And tough provides a solution to trust delegation and public key propagation to downloaders. You can think of it as a clever way of allowing quite a bit of complexity in delegating trust and doing key changes without making any of that a user problem or even a downloader client problem. And when it does that, it also provides this interesting feature of making the repository fairly compromise resilient. You cannot as easily compromise the downloader clients by compromising the repository if you're using this. And I'll need to explain a bit about how it works. Sorry about that. I promise to get back to the package repositories in just a moment. So in a tough implementation, it typically means a client library used by a client application that wants to download something and then some sort of repository software that maintains metadata that the client uses to verify any content found in the repository. So for clients, this is pretty easy. They use a tough library like an HTTP library. You just tell it which artifact you would like to download, a tough client downloads it, make sure the content is correct, and it does this by first downloading some tough metadata, which is JSON files that the repository makes available, and those metadata files contain essentially a chain of signatures that proves the artifact is not compromised. And the repository then maintains that metadata. And the metadata is essentially JSON files that each define what's called a role. There's quite a bit more complexity in the format. The main point is that each role can do two things. It can list artifacts it knows about. So this is the info that the clients are actually looking for. And what I mean by artifact is just a download URL and a content hash for that URL so that you can download it and know that it's correct. But the other thing it can do is delegate trust further to other roles. And this delegation can be kind of focused. We might delegate some artifacts and not others, and then delegate maybe other artifacts to some other role. So here's a simple example repository. There is a diagram of delegation starting from the root role that delegates to targets role that delegates to two more roles that then actually list one artifact each. The actual tough architecture contains a few more mandatory roles, but I think these are relevant to understanding the trust delegation and key propagation stuff. So the delegating role decides both whether it delegates trust onwards and the keys that the delegated role must be signed with. So what this means is that the root role essentially operates as a source of trust that then propagates onwards. And what that means is that your client can be bootstrapped with just the root keys and that's it. Then you can get all of that, the rest of the trust tree dynamically when you need it. So root and targets and a few other roles that are not drawn here are hard-coded in the spec. But everything to the right of targets here is up to the specific repository and it can change dynamically as needed. So new metadata versions might add new artifacts, remove artifacts, but could also add new delegations and new delegation trees. So to maybe make that example a bit more concrete, if we have this repository and a downloader client wants to download x.zip, then there's a defined process where the tough client library makes sure that the top-level roles, root targets and a few others are not drawn. These are up to date and signed. Then the tough client sees that targets delegate forward to project x. Then it goes ahead and downloads project x metadata and verifies that it's signed by keys defined in the previous delegating role. And then further finds that there is a matching artifact that has a hash and it can download the artifact and check that the hash is correct. So to compromise this download process, the attacker needs access to either root, targets or project x keys. So getting control of just the hosting infrastructure for x.zip or the metadata is not enough. If you want to compromise clients, you need access to keys. Yeah, that was a quick primer on tough. I hope that's enough about that and we can move on to the actual content. So I'd like to talk about the plans, how you would integrate this into the Python ecosystem. We've got two PEPs. PEP is short for Python enhancement proposal. And this one, PEP 458, sorry, was written in 2013 and it defines a repository signing model. That means the maintainers of the projects do not sign things, but instead every time a new artifact is uploaded by a maintainer, the PIPI software adds those release artifacts to the relevant metadata and then signs it. And this kind of system protects against file hosting compromise. So the metadata or the Python index files that would be the content here. But it wouldn't protect again against compromise of the signing component, of course, because as we discussed, you need the key available all the time because there are new releases all the time. So the key is online. So it is compromise resilient to a point and also handles the key propagation and enables compromise recovery. If keys are compromised, then they're easy to replace with the system. There is another proposal that's a bit more interesting. Just a year later, so 2014, PEP 480 was written and it defines developer signing. Each project potentially signing their own releases. And it does contain some major workflow changes for maintainers and administrators, like maintainers of Python projects and administrators of PIPI. But it does protect against all infrastructure compromise and creates a secure supply chain from developer all the way to user. And what I mean is that if you remember the diagram of roles, imagine that a bit more complex, a bit more deep, and the whole chain of keys from root to targets to some midway delegated roles to the final project role, the whole chain can be offline keys. They could even be hardware keys in someone's drawer. And what that means that it is not possible to compromise any downloader clients without access to at least one of those keys. And none of those keys are on the server. And I don't know about you, but I think this is just a very cool idea and it would make the whole supply chain resilient to PIPI compromise. Period. I was not involved in the project at this time, but I would assume that the thinking was that there is now a simple two-step process. We implement PEP 450, then we implement PEP 480, and the result is like a Fort Knox of package repositories. But before we get to what has actually happened in this space, I just want to quickly mention how tough and general progress from that point, like 2014 or so, to today. I think it's fair to say that the specification has stood the test of time. It's improved, but in general, it gets proven to be useful in practice and secure. There are many more implementations now. It's in use in several operating systems as an update system and in other places as well. Practically useful and clearly implementable. What about the PIPI case that I was talking about? We didn't get a Fort Knox. We maybe got the Sagrada Familia instead. About nine years has passed since the first proposal and we don't have anything running list publicly at this point. This is what I want to spend the rest of the talk on. Why was this much harder than was expected and what could we learn from this? I don't want to make it seem like there is no success at all. The PEPs have resulted in a lot of flow-level work in the spec and in the tough Python implementation, which is a far better software project now than it used to be. PEP458, the first one, has had considerable progress and very smart people are working on it right now and I trust that that one will happen hopefully soon. But the reason I wanted to talk about this is that nine, ten years from making the proposal to maybe getting it into production, it's a long time and that's the simpler proposal of those two. Much, much simpler. If you ask me as something of a subject matter expert, I would say that with this same trajectory, PEP480 is not going to happen. It's just a far more difficult problem. So I would like to see if there are other trajectories available to us. Right, so what can we learn? I have some suggestions. There's a couple just kind of very specifically about tough and how the spec works and how the software is developed. And then some comments on the enhancement proposals themselves and then some sort of more general suggestions what we could do in practice to get things like this moving. So one clear issue has been that while implementing simple repositories with tough is not too bad. This turned out to be far more difficult for complex and highly automated cases. And I think there is a clear underlying reason for that and that is that the spec only defines what the client does. The repository isn't simpler to implement but the complexity isn't obvious to the spec reader because the spec only talks about the client. And maybe this sounds obvious now that if you see a spec that only talks about the client you'd notice that that's not the whole story. But if you write specs to simple engineers like me you really should have big letters saying this is just half the story when you've implemented this you've only got about 50% done. It took me an embarrassingly long time to figure this out. So maybe that should have been said. And if you're wondering why that is a core decision in the tough spec it's that the spec is supposed to be generic. It's not like a specific update system. It's a framework for building update systems and that works great for the client because the client is essentially the same for everything and it kind of works. You can build wildly different repositories with Tough. As an example you could have a repo maintained by a single person with no server software running at all just something you run on your local machine and publish the files out every few weeks or months. Or you could have a repo with thousands of maintainers and a lot of automation on the server side. And the same workflows aren't going to work for both. So I can see why the spec doesn't define any repository workflows but something really should. Because the result of this specifications hyper-focused on the client has been that we simple implementation developers spend a lot of time very very carefully implementing the client spec to the finest detail. And then when it comes to the repository tools we throw something on the wall and we don't carefully define those repository operations either and it probably doesn't come out very well. And we probably should have chosen a specific or maybe multiple types of repositories and provided good support for those specific use cases. But that's not how most or almost all tough repositories or tough implementations that I know of operate. So yeah, that was about tough. Let's look about the PEPs. What could the PEPs have done better? So these are the enhancement proposals for the Python ecosystem. While attacks and protections and tough mechanisms are important, they maybe didn't make the best possible enhancement proposal for the Python ecosystem. Because they don't talk so much about PyPI workflows and processes and how this affects the clients or the developer tooling and so on. And the PyPI developers themselves have not been super involved in this process. And I can understand that. Spending time on a proposal that mostly doesn't talk about the things that you're interested in, maybe that doesn't sound very appealing. And while the PEPs have a lot of focus on tough mechanisms and they're talked about in great detail, they don't actually include a software design or prototypes. And I don't mean PyPI prototypes. That's a lot to ask. But I mean tough repository prototypes that would validate the design that is being proposed. Because something like PEP480 has not been implemented anywhere. And PEP480 is a very ambitious design. It will require prototyping somewhere. And now if you think about PyPI, that is a very conservative project for good reasons. It's the backbone of a massive number of supply chains and prototyping there does not seem like a good idea. Finally, especially the developer signing, so that's PEP480, but absolutely disrupt PyPI workflows at many levels. This disruption really needs to be well-defined before a proposal like this is credible. And it probably shouldn't go into too much details, but if you're wondering what workflow changes I might mean, you could think about identity or key changes. So now if you're a Python project maintainer you want to add a new maintainer, you go to the website, you click a button and write the name of the maintainer or the username. And that's pretty much it. But if all of that... and that's the trust delegation currently. So if all of that is in locally signed information, then you have to start by maybe the new developer needs to sign a message that contains his or her public key. And that has to be made available to the original developer who can then do the trust delegation change so that it includes that public key and that has to be made part of the repository. And now we're talking about multiple local tools that are required in this workflow that used to be a single click almost on a website. So that's a big change, a changing user experience like that for tens of thousands or hundreds of thousands of people. It's a pretty big thing. And I think that currently, we don't have a reliable view of the required level of disruption. That was just an example. And in my opinion, that means we can't really tell if PEP480 is a viable solution for PIPI. And I think we really should be able to tell if we have an enhancement proposal for it. That sounds a bit harsh and I want to stress that I think I still think this is a very, very cool idea and worth investment and work. But we do need some kind of a different approach if we want to figure this out. Okay, so finally, looking at the sort of wider ecosystem and stepping back maybe to the original problem statement that the original design of community repositories is from a more innocent time and then add to that the evidence that we've seen that some solutions just need more experimentation and also maybe more resources than individual projects like PIPI are willing or can expand. Then it may not be sustainable for each repository and ecosystem to experiment and develop all of these solutions on their own. But the repositories are more like than they are different. A lot of the problems are shared. In the case that I've described as a good example, it's an issue and what I've talked about as a potential solution is a potential solution for a lot of different repositories. So it would make sense to combine four C's somehow and the number one forum, I think, currently seems to be the securing software repositories working group. There seems to be real momentum there. So if anyone has ideas about in this sort of space, I would suggest that as the place to do it. And then there's a more specific specific place to experiment. I've got a project that I'd like to continue that basically is about prototyping exactly this developer signing for repositories like the ones we've talked about. Because the fact that community repositories are conservative and lack resources means we need places where we can experiment with specific designs. And it probably doesn't make sense to do that in those repositories themselves. And for the PIP480 use case, and I'm using that to just describe this general design of developer signing, my plan is to use repository playground, the link's there for this. And if you're interested, please have a look and get in touch. It's still in fairly early stages, but the plan is to develop some tough repository workflows and processes for this community package repository use case, actually try what works and what does not. And the goals are basically to find out if tough is a viable solution and then produce an implementation and documentation that allows repository projects like PyPI to evaluate the idea and see if it can work for them. Can they take the design and use it in their own systems? Yeah, that. That is what I have. I think we should have some time for questions or comments. And I think I should put this on if there are any. All right, go ahead. It's a quite interesting problem, right? I was wondering, because you talked a bit about tough and the PEP and PyPI and the working group in MSSF and so on, it's all kind of different parties who are maybe not collaborating enough. Is that impression true? Because I would think that the security of the PyPI repository would be one of the prime concerns of the people who are running PyPI and developing it, so is there some gap there or can you share some thoughts on that? Yeah, I suppose that's a really common problem in open source, right? And I don't have any fantastic solutions for it, but it's really understandable. Projects like PyPI aren't exactly the sort of sexist objects that get a lot of new developers all the time, and also they can't just take anyone to just do changes into PyPI source code just like that. Then you have to be careful about it. So there isn't like a huge amount of resources available to just start looking at experimental projects and I'm not involved in developing PyPI itself, but I can certainly understand that seeing far-reaching proposals for it, you might look at that and then decide that, well, I've got these much more pressing concerns to work on right now. It's a tricky problem, but definitely working together probably is part of the solution. Sorry. With so many developers and so many keys that are going to be compromises somewhere, is there a revocation mechanism built into the framework? Yeah, yeah. It's not even revocation in a sense, but basically when you do try to download something, the whole trust chain is updated, if that makes sense. So you always have the newest keys that are valid for that artifact. So yes, that is something tough-handled. Hi, my name is Nick. I was wondering, since you mentioned that the PyPI repository system was designed in the times when security was not really big of a concern, so it might have some security issues and it had it at the time you did the proposal for the PEPs, right? I'm wondering, since none of those PEPs have been implemented, what are the security measures that are taking like today and that were taken during those nine years apart from your PEPs that enhance the security, of course, and the second part of the question is if it worked for nine years, I hope successfully, why it couldn't be used anymore? So on the generic question of PyPI security, they've definitely worked on various aspects of it, but like I said, I'm not a PyPI developer, so I'll not go into details, but they've definitely done that and I think there is things like two-factor authentication and so on in use. And the second part, like no, this tough has been implemented during these years in several places, yes, but these are specific implementations and the implementation for PyPI in any form hasn't been in operation yet, so that's the status. Anything else? Yeah, could you comment on what the difference might be between tough, the generic implementation and sigstore in general, because that seems to be the going topic within the securing of the software supply chain repos working group that you mentioned earlier. So just generically, what kind of the differences and overlaps might be there. Right, yeah, it gets a bit complex. So sigstore and tough are kind of very related, but I don't think they overlap really. So just to give two examples, you could use, this is not right now possible, but with small changes to the tough key system, you could use sigstore keys within tough, so you could do signing using sigstore keys so developers would need actual long-living private keys, you'd use sigstore, and then use those within the tough system. Then another sort of commonality is that sigstore actually uses tough to deliver some of their sources of trust to their clients. What else? Yeah, so I don't think that they necessarily overlap as in they solve slightly different problems in this space, and of course you can choose which you prioritize and also which seems more achievable in the time frame that you have, but both make sense to me. All right, I think we're done. Thank you, everyone.