 Thanks for coming. Thanks for having us. My name is Lucas Pyrenga. I work at NYU as a researcher and software engineer on Software supply chain security project. It's one of them being the update framework and I'm here with my friend Cairo from VMware Hi, so my name is Cairo. I work at VMware in the open source program office And I'm a member of the security supply team. We work with upstream open source products in the security supply chain Okay, we are going to talk about PEP 458 which is a specification to secure PIPI downloads And yeah, let's take a look at the agenda of the talk before we dive into PEP 458 We will briefly talk about package distribution About how to secure it with the update framework, which is the basis for PEP 458 Cairo will talk or give us a couple of numbers about warehouse, which is the software that powers PIPI Then we'll dive into details of PEP 458 Cairo will talk about his integration journey Implementing the whole thing and if we still have time because that's a lot We will talk about what's next Okay, let's start with package distribution I probably don't have to tell you the Python runs everywhere You have it on servers on container clusters on IOT devices Really everywhere it runs the world In order for it to run everywhere you need a distribution platform where you get it from and some client that you use to get it and usually use PIP install for that and Usually you get or hopefully you get the software that you want to download when you do a PIP install I Say hopefully because the package distribution infrastructure is also a very attractive target for attackers Because they Can with a single compromise Multiply that compromise to all the users two thousands millions of users that run PIP install So I guess we all agree that we somehow have to protect these packages we have to make precautions for the event of a key compromise in the publishing infrastructure and We have to do that a scale because You rarely have one package that you try to download and You only need to protect that But that package depends on other packages which themselves depend on other packages So you have this huge graph of dependencies and we heard dependencies in the prior talk So again if anywhere deep down in that dependency graph you got a compromise that gets propagated propagated to like everywhere downstream and affects all the clients that download it and These things have happened these things have happened so well, I don't know if everyone saw the prior talk so It hasn't happened to Pi PI yet We're more worried about typosquadding in Pi PI, but it can happen to Pi PI. So we need to protect key compromises in Pi PI and maybe I can have a raise of hands of who can think of a successful compromise of publishing infrastructure in the recent past Okay, and who has heard of the name solar winds or solar burst Okay, so these things really happen So we need to find a solution and one solution is to sign everything cryptographic signatures really help to guarantee authenticity and integrity of your software artifacts But they are not enough we have to think of other things as well we have to make trust decisions at scale because if you get a signature you also have to know which key to trust for that signature and we have again prepare for the Event of a key compromise. What do you do? How do you report it to your users that the key should no longer be trusted and Then there are other problems like freshness and consistency of the repository and a good solution a good generic solution for that is the update framework it come came out of research of Security research of package managers almost 15 years ago, but it's still very relevant So tough works for every setup that has some notion of a repository service content repository server and Some recurring clients that come and fetch software, but also try to update it We heard in the in the prior talk that you like one of the key security Actions that you can take is to update your software So you don't only want to fetch once but come back and update update update Tough has built-in protection for freshness consistency and integrity of software It is resilient to compromise. So it both reduces the impact of a compromise and it also allows in-band recovery It does this by specifying a bunch of roles different roles for different Responsibilities in the content repost repository. So there's one for the integrity of the software artifacts There is one for the consistency of the overall software repository one for freshness and then there is one for the root of trust which defines the signing keys and Roles are a rather abstract concept in tough, but it's actually pretty simple. They're all Represented by one or more metadata files that are signed with a cryptographic key before we go into more details or we won't go into more details about tough right now, but Later when we look at pep 458 how this works in the scope of PI PI But before that Cairo will Talk a little bit about warehouse So, yeah, I'll give you an introduction about warehouse because pep 458 Mainly working in warehouse, but what is warehouse? so most of people are not familiar with the the word the name warehouse, but as The description says is is a software that Powers by PI. It's the the solution for the the repository The the package index. So anytime that someone is running by up in style with the full repository it's going to warehouse Because it goes to the PI PI. So here are some numbers to To share with you how important is warehouse. It's a more than 280,000 projects hosted there. It means also more than 600,000 users and Daily it serves more than 900 terabytes and two billion of requests. So That is a big impact Making warehouse safety. I will Give back to my friend here. Okay. Thanks Okay, let's look at pep 458 I already said it In the beginning of the talk pep 458 is sort of specification or enhancement proposal for setting up a minimal tough design on PI PI and The advanced or the the security properties it gets from tough are that it makes storage and transport security Non-critical so you don't have to trust in the artifacts anymore They are fine, but you have additional metadata that you can With with which you can verify the artifacts You get protection against rollback or freeze attacks that are attacks where an attacker pretends that there is no new software freeze attack or that like Serves old software to you with a valid signature But which might can contain vulnerabilities then it has this great feature of invent key revocation and in the case of the compromise it limits it limits the Successfulness of the compromise to To certain expiration dates. That's what we call implicit revocation The pep does not change any of the user flows neither for the software uploaders for the developers nor for the clients It doesn't really have to change the way how you use pip And something that we will talk a little bit in the end of the talk Is that it's a very important building building block for more complicated, but More secure designs for PI PI So Let's take a look at how pet for have eight uses tough. So first of all it Defines these different roles represented by metadata For the repository it has a root role Which is the role that provides the root of trust it defines all the allowed signing keys and signature thresholds for any other role included itself then it has these other roles that are That are responsible for freshness consistency and integrity of the artifacts and Yeah, okay Maybe so that you know the terminology targets is the one that The targets role is responsible for integrity snapshot is the one responsible for the Consistency and timestamp stem for freshness except for targets. Those are maybe a little bit self-explaning The large advantage of having separate roles is that you can balance responsibility With risk and availability of certain roles So for instance If you have a role that has high responsibility But doesn't need to be available that much you can Minimize the risk of key compromise by putting that role offline. So for the root role We only use offline signing keys Whereas the roles That need to be highly available because they sign on every upload of artifacts Are at higher risk of being compromised because they're online So we try to minimize their responsibility. We don't like let those roles Reboke keys for instance Then you can also like another benefit of having separate roles you can Balance risk availability and expiration so for the root role that is not at high risk of being compromised and Does not need to be available that much Or changed that much we can set long expiration intervals Whereas for the ones that need to sign like every day many times every second actually We can have very short expiration time so that when they get compromised the online key gets compromised And a taker can't use that compromise key for a long time and One more feature of having separate roles is that we can have different Signature threshold so for the online key it threshold doesn't make much sense Because if you compromise one online key, you might as well compromise another one But for offline keys we can so the pep recommends to have Multiple key holders being appointed by the Python software Foundation and An attacker needs to compromise a threshold of those keys to compromise the role And last but not least with those different roles we do a key revocation when We need to change the key for this lower like these these online roles which has changed change it in the root role and The root gets shipped out to the client and the client always knows which keys to trust One thing to mention you might ask yourself why these roles at the bottom are in one Square or in one rectangle And have only one key. Why do you need separate roles for this setup? We actually wouldn't need that many like these three separate roles, but for other setups like the more advanced Protection that we'll talk about later This setup makes sense Okay, before we talk about how Cairo is implementing this in PyPI Summing for like a few data points about the tough tooling that is available we have a Python tough reference implementation which both shows like has has the goal of helping to understand this tough specification But it's also usable usable and actually used in production by large company companies It has two parts one is the client downloader which is basically an off-the-shelf solution for most package managers because The specification has like very clearly defined Workflows for the client and they are the same for many different repository setups The repository side on the other hand varies a lot between different implementations. So that's also why the The top Python tough tooling has very powerful Very powerful tools for a repository, but also very Well, you still need to know tough very well to use them And Cairo knows about that and he has struggled with that while implementing pep 458 So, yeah, I will share here a bit my journey about implementing this this pep so I joined recently a few months ago the open source program of at VMR and As a my Goal in that team would be implementing the that for 58 so the first How I start with that was reading the very good Specification that tough has And they start contributing with Python tough Project that the Python tough product is the one that Lucas mentioned that provides the metadata API and the download client so they the tough team that Lucas is team is part they are about to release the release one dot zero from Python tough and I was falling it very close and doing some contributions. So when I start Say let's say that when I saw that the the Python tough was in very good shape. I start the implementation for pep 458 and I started When I started my difficulty was the pep is is huge the description of the pep is huge, but there's no a lot of information about the design implementation how to implement tough to the warehouse and as I I shared Warehouses is is huge. It's big software And the security there is very important. So I Strayed a bit to to implement it But as you can see we have a I have a pull request on review there I need to mention also that the exercising of pep 548 for the tough team was very interesting because they Were able to construct a very good metadata API and give you a lot of flexibility for who want to implement tough in your repository software or Infrastructure so I Really went to bite on tough get help to to implement it and use this metadata API and Yeah, we have something Open to review for the pipe I maintainers and Here it's just a small demonstration about peep and the warehouse working with Sorry people working for our house With tough enable then you can see Here in the verbose mode that it's using the the metadata So this peep version is a POC that I got from another a colleague and I just just did some chains, but we can let's say prove that it works. I Need to say that the implementation this pull request that we have now. It's not the entire implementation. It just enabled pep 458 to warehouse in development mode, so I just run it in your machine, but it's a I start so What's the status now? Yeah, I just left to By PI folks a huge pull request that is hanging there because it's really really hard to review It's a lot of code But now with some discussions. We are improving this Lucas is someone that helped me to Reduce this code make it more easy to review and That will be one of our next step. So What we want to do now next Here I'm sharing the status with you where we are now and what we want to do. Yeah now we have Tough and pep 458 to working in warehouse in develop mode, but this is just at least for me now But we want to start connect the flows for warehouse to the to the Connect the tough to the warehouse flows. It means that when someone below the new a Package new version of package it will populate the all metadata and also when we want to implement the PIP that is the POC we want to have it live of course and In the end of course when everything is working the the hard part is all out into production. So But I might say that as Python users We will be not affect affected by the the way to use by PI or For developers to push it but the background will be that so I Get back to look as to To share about Nice. Yeah, even further down the road. We have some plans for PI PI so I mentioned earlier that pep 458 is a really important building block because By itself it already gives you a lot of security guarantees, but it is still sort of susceptible to Compromises of the online publishing infrastructure like it already mitigates Those compromises in various ways and it allows to recover from them But still if the online keys are compromised An attacker can serve Well, no can serve arbitrary code So there is this addition to pep 5a for 458, which is called pep 480 It uses the same basic metadata layout. It has the root metadata for the signing keys. It has Time-stem snapshot and targets top-level targets metadata for these other properties and Then it spans this whole new delegation tree trust delegation tree on the targets metadata Which creates trust namespaces for developers? So it says which keys can be trusted to this to sign specific Projects uploaded to PI PI So certain keys for Django certain keys for NumPy and so on and the signatures are then actually provided with offline keys by the developers So then even if the online publishing infrastructure gets compromised an attacker cannot serve arbitrary software So this really protects against online key compromise It still does not change any any workflows for the for the client downloader They still just do pip install and are not bothered with any of this at all But it does change developer workflows. So they need to handle private keys We somehow need to establish this initial trust between the developer and PI PI Developers need to somehow create tough metadata that will then be Push to PI PI in order to serve it to the users So there are a lot of challenges with this and pep 458 already is complicated. So We will first focus on Finalizing pep 458 and then continue with 480 And if you want to follow the developments, there is a new discussion thread on Python discuss Those slides will be available online so you can Check check the URL later If you're interested in tough in general It is not only concerned with PI PI. We work with many other software Updaters and package managers Check out the website Visitors on the slack channel in the CNCF workspace tough is a CNCF project for those of you who know the CNCF Or drop an email on our mailing list Yeah, and with that I'd like to thank you for your intention and Open the floor for questions Thank you Thank you for the talk. I just wanted to understand how the truth framework Works with things like build reproducibility. I mean in the current model Do you understand right that if you compromise the way of getting signatures or keys? you can still sign malware or something and It seems like in this framework you can specify the target files So what do you do of those packages which target files are quite dynamic? for example Mash in all new models or something like that which have very Expressive setup dot pi which take decision based on the CPU fixture supported on your target computer extra How those two works for that? Okay, I'm not sure I understand every understood everything the first part was about reproducible builds. Is that correct? Yeah. Yeah so Tough is not really concerned with reproducible build So it provides metadata where you put one signature for a software artifact But at the same research lab where we develop tough we developed a different project called in total Which is basically tough sister project And it deals with like prior steps to the publishing step of the software supply chain And it also lets you a test for reproducibility reproducibility of a software build and it works very well together with tough so The software company datadog actually uses tough and in total to protect their entire supply chain and The second part was about Cve skins I didn't fully understand what you were asking. It's about very dynamic packages such as tensorflow or Yu-Ging face which requires to download binary models based on some very dynamic decisions CPU fixtures International age internalization localization extra extra Okay, and what a tough how does tough handle those because you cannot specify the target files You want to answer that question? Yeah, maybe I Not understood, but I what I can say that in that case will be just a target sign it in the repository and Then it tough and the warehouse for example will not They don't deal with the name of the package this kind of thing It's just about having a metadata that can reference to the target files So if you have multiple target files there that will be Linked by CPU or any kind of configuration is just about the file that is being indexed and Doesn't matter much too much for that. Yeah, okay. It makes sense. Thanks. You very much. I thank you for the amazing talk It was very interesting Maybe you mentioned it but I'm I'm curious about Like once the implementation Implementation is ready and odd How are you going to handle the transition? For those repositories they're Not anymore maintained like the the backwards Compatibility how it will look like you want to say you can go actually Let's say if you see the development version now Actually, it doesn't care Too much what we have what it's back old package versions Actually, we'll just get what we have in the database now from warehouse and create the metadata for all entire Entire data that is there for now is that our decision But we don't know if when we go for the whole out We would do like a kind of a cleanup or just the classified package that will be in the metadata first We don't have a clear transition yet. We still like trying to attach stuff to the warehouse and to the flows then The the whole out will be something to be planted with the PSF folks Okay, thank you. Yeah, maybe one thing to add so for pep 458 There it specifies this thing called back signing where we just like sign all the packages that are already there For pep 480. It's a bit more complicated. This target delegation tree is actually not accurate So it's more complicated in order to support this transition from unsigned to signed packages I left it out to not confuse you more and because it's not security relevant but if you search pep 480 you'll have exact like all the details of the delegation that also allow you to transition from The status quo to this model Yeah, then then for this one we will have more impact about the projects and Users because we are saying that we have the developers did to sign No question Great talk. Thanks So actually with pep 480 this looks really nice So as a company, can you basically whitelist a series of keys saying? Oh, we want we allow, you know, beautiful soup Django We want to whitelist those keys then our own private keys So that we don't really care where the packages come from and sort of that as long as they're signed with these keys We'll only accept those packages Okay, can you repeat your question and speak closer to the microphone? I'm sorry So just wondering with pep 480 it looks like it opens up the possibility of As a in a company basically having a whitelist of signing keys that they trust so they're saying okay We trust the Django developers. We've checked their key. We trust numpy. We have our own corporate key And then we just want to reject everything else that isn't signed by those when we install packages Yeah, absolutely Enjoy lunch everyone. Thanks. Thank you