 Thank you, everyone, for coming. Last session of the conference, it's been a fun week. Thank you all for making it out. Hi, my name is Billy Lynch, and today we're gonna be talking all things provenance, probably a lot of buzzwords you've heard before if you've been around the supply chain security space, provenance, salsa, attestations. And in particular, what I wanna talk about is, often when we talk about provenance, these things are coming from CI pipelines, and there's an interesting question to sort of tease out here of like, how do we actually trust provenance that's being generated from sort of user-controllable environments? And this is what I'm gonna refer to as user space. So a little bit about myself, again, my name's Billy. I'm a staff software engineer over at ChainGuard where we do all things supply chain security. I'm a maintainer for a few open-source projects, namely a six-store Git sign, which is signing Git commits with this extra keyless flow, as well as Tecton chains, which is generating build provenance for Tecton pipelines, which is a open-source CI platform on top of Kubernetes. So the question I want to first ask is how do we trust software, right? When we pull something down from the internet, how do we trust it? What do we look for? And so there's a few different things, just intuitively you might look at. Where is it coming from, right? Is it coming from a repository that you know and trust? Is it coming from a package repository? Or is it coming from a random file download site, right? That might be a signal that it's less trustworthy. Is it coming from who you expect, right? Downloading a Kubernetes image from my personal GitHub is probably less reliable than downloading it from Kubernetes itself. Where did it come from? When was it made? You might be running Postgres or Kubernetes. Something that was probably built more recently is most likely gonna be more reliable than something built like two or three years ago, just because of new known vulnerabilities that were discovered, patched, released since then. Doesn't mean that necessarily the old version is the wrong one to use, but it's just another signal to look for. As well as, what is in the thing that you download? This is sort of a new area, so you might hear a lot about S-bombs and provenance and are there vulnerability scans that have been ran on this piece of software? That's also useful information to have, though not all artifacts might have that today. And so when we talk about all these things, first buzzword, provenance. What we really mean by provenance, or I'll get to that in a sec. Provenance in general is a concept that's really starting to gain a lot more traction. There's been a few announcements recently. NPM has announced that they're gonna start requiring provenance for NPM packages or not required, but you can make it available as part of the package. A lot of container images are starting to attach provenance to the things that they publish. A lot of open source projects, SigStore, Salsa, and Toto sort of further help this along. And this diagram at the bottom is actually taken from the SigStore landscape. These are all projects that have basically self-reported that they are signing and producing attestations for the software that they produce. So what is provenance? So this term comes somewhat from the art world. So the analogy here is similar to when you look at a piece of art, we're able to trace back, okay, this person had it at this date, it was sold, and we could trace it back all the way back to the original creator, we wanna be able to do the same thing with software. So it's really the who, what, where, when, why of software artifacts. And then another word that you'll probably hear a lot when talking about provenance is attestations. And what that really means is who is actually making this assertion, who's making this claim. And so the quick overview is some provenance which is metadata plus some identity and that's how you get an attestation. So just a quick definition there. So this is an example of what a provenance document looks like. One of the things that we do at Schengard is we provide secure by default images for a lot of popular open source projects. So this is just provenance I pull from one of our images. I forget which one it is actually. But you can see some useful information here. You have what's called a predicate type. So it's what type of metadata are we including here? In this case it's a salsa document. You can see identity information for where did this come from. So even if you haven't seen this before at all, there's some useful pointers like, hey, there's something here that says this came from GitHub Actions or we assume it's coming from GitHub Actions. We have a repository URL that, maybe we should go dig into that and see what it's doing. We can see commit information about what was the repo, what shot it had come from, what's the build config, what's the attempt number. And so this is all super useful information that we can use in order to start making decisions about do we trust this, do we not. And if you're just quick shout outs, if you're interested in this, this table with something I threw together as a weekend project, you can just pop in any OCI image there and it'll spit out whatever it knows that it can find around signatures and attestations if you wanna play around. So that brings us to the problem. All of this metadata is sort of being generated by CI pipelines. So how do we actually know that it's accurate? How do we know that it's correct? Because the same CI pipeline that's producing this could also be lying. It could just be producing what it wants you to think is happening. And this scary situation is if there's ever an account takeover or someone slips something into a repo, you may not be able to trust the metadata that's being produced. So what do we do about it? And that's sort of what I wanna dig into more here. So the first challenge is identity. I mentioned with attestations before, providence plus identity equals attestation. It's important to know where things are coming from. Information coming from a developer's personal machine may not be as trustworthy as something coming from your production CI pipeline. And so we wanna be able to link this data together. And usually that's done with some form of signature. Typically this could just be a standard public private key pair. But I will make a note here. When you're using long lift keys, that also introduces a bunch of challenges around key management. Because using the assumption that a key equals an identity is only true so long as that identity holds onto that key. And so if it ever leaks, if anything ever happens to it, you need to have a plan to sort of remediate that. So I will make a quick shout out. We won't go too deep into it here. But the six-door project, if you haven't seen it before, it's a fantastic open source tool. The main, one of the main core principles here is basically exchanging short-lived OIDC tokens, authorization tokens, that assert an identity and exchanging that for a short-lived certificate. And the idea is with that process, you can actually generate keys on the fly. Brand new ephemeral key every single time, sign it with that CACER, bind it with that identity. And so no longer are you making the claim like, oh, was it signed by this key? Now you can say, oh, was this a key that was bound to Billy at chengar.dev or my production CI workload? So if you haven't checked it out, please do. Super useful. And if you're an open source maintainer, the six-door project actually maintains a public good instance that's completely free for open source projects and anyone to use, actually. So you don't have to worry about running your own infrastructure if you don't want to. But even that's not enough. So even if we have signatures, even if we use six-door and bind it to identity, some identities are better than others. So the example here is, if you have two tokens or two certificates, one of them comes from accounts.google.com. Another one comes from justtrustme.dev, which is a real site, by the way. One of my friends, Eddie, made this for another talk. It's a valid OIDC issuer that will issue you a valid token for whatever you give it. So it is completely valid and you can actually mint tokens with this. But this is a very important thing to make a distinction, especially for companies. For example, some of the supported issuers for six-door are like Google and GitHub. For our own personal corporate accounts, the source of truth for our identity should be Google, not GitHub. So knowing who is making that claim, who is making that assertion is very important. And this also has impacts on how you manage your identities and policies internally. For example, if you're supporting an open source project and open source artifacts, you'll probably want to use a well-known identity provider, something like Google, something like GitHub, just because pragmatically, larger companies probably have more resources to deal with authentication and identity issues. But if you want to run this yourself, you're more than welcome to, but you also have to do that with the understanding that other people that don't have the same visibility into that infrastructure may not have that same level of trust in it that you do. So this is gonna be a very personal decision for how you sort of approach what is trustworthy and what is not. Challenge number two. Even if we know where it's coming from, how do we know it's correct? How do we know it's not lying? How do we know that the data is actually in there? And so there's sort of two approaches that I've seen that are done to do this. One of them I call user space. So these are things like you generate the provenance in the build itself. So this would be as part of your GitHub action, you run a vulnerability scan or you run a build and then generate a provenance document that says, oh, this is the S-bomb of things that were included in the build. So that's actually great in most cases because your build tool is best equipped to know what actually goes into your build. And it also means that users themselves can actually write the documents that are most relevant to them. There's a few sort of predetermined formats. So Salsa is a project under the OpenSSF which is trying to standardize like build CI job arguments and what goes into build processes and how they're run. But if you wanna use another format or you wanna include some other metadata about the artifact, you should be empowered to do that but not all tools will necessarily know how to do that or know how to handle it. So that's a benefit but it's also a con because if you can do whatever you want then that also means that you can do whatever you want and also lie. So that's a bit of a challenge. On the other hand, you have what I call sort of system space. So these would be managed control planes, things that you don't necessarily have direct control of that are making these assertions on behalf of you. So these would be things like, you build an image and then maybe an image scanner tries to go in and build an S-bomb based on the inputs. It's a little, there are some better qualities in some sense because the users themselves that are making the PRs and build these things don't necessarily have control to mutate and lie about what gets produced. However, because you're sort of running it out of bands it may not have the best visibility into getting all the information that you want or need. And if you wanna do more precise data you now have to go to your system provider to say, hey, please add this feature for me because I want this other piece of metadata that you don't support out of the box. So the question is, what is better? And the answer is really like neither. They both have good properties. They both have trade-offs. But an idea that sort of comes about is, okay, well, what if we can take both of them and get the best of both worlds? And that's what we kind of mean by supply chains and building the chain. So the idea that you hear is like user space provenance is really good for being very precise about what goes into a build knowing what goes into an artifact. Sorry, not knowing what goes into a build but knowing what goes into an artifact. But the system space can say with a lot more authority what went into that build, what went into the process that kicked off to generate that artifact. And so if you put these two things together you can actually make this correlation between and this link between the two in order to verify them. So on the right here we can see, by default we can just have a user space thing generate provenance about the artifacts. But then we can also layer that with system provenance and build provenance to then say, hey, we're gonna have provenance about the build process itself. And then what we can do from there is say, all right, in order to trust the artifact provenance we can check that, make sure that that's okay. But then in order to trust that provenance we're then gonna cross-rough that with that system level provenance in order to have that sort of less user controlled space and be able to link it against. So that was very abstract. So I wanna go into sort of like more real life examples. And there's actually multiple ways you can do this. So one is the Salsa project also has what they call GitHub Actions Generators for generating Salsa provenance. And what happens here is they are pre-made GitHub Action workflows that you can plug in and what you're doing is you're giving up control over how your artifact is being built. But in exchange for that control you're getting more security guarantees about how that thing was built. And so how this looks is in your GitHub Actions workflow you reference the Salsa Generator, you give it like, okay this is the repo to build, this is what language to use, this is what folder to look into, stuff like that. And then that takes over, builds your artifact and then hands you the output. And so here the user space is your user workflow that called that Salsa Generator. The system space is the Salsa Generator action itself because you don't actually have control of it, you just have what that reference do. And so going back to what we said before what you can do then is when you're verifying your artifacts if that artifact didn't come from that Salsa Generator because that's what gets included in that certificate at least for six store, you can actually assert, okay this was actually came from the CI job that I expected. If it came from somewhere else, another repo or repo you don't control that's a signal for you like, okay we don't actually know what's in here, we can't necessarily trust it. And going back to the example before this is exactly what was happening before. Not with the Salsa Generator but you can see here with this is Chingard's own release action. And so the expectation is like for Chingard images it should be the Chingard release workflow and if it's not that's a problem. And yeah so same process for Salsa as well. That's not the only way to do it though. So the other project that I helped maintain is Tecton chains as I mentioned before what chains does is it watches Tecton pipelines and it generates build provenance by watching what happens in the cluster. And so what happens here is users just generate pipelines and run tasks as usual. Those tasks might produce the provenance inputs, outputs Salsa provenance things like that. And then what chains will do is it will generate build provenance about the things that it observes. And so even though the user isn't doing anything special beyond that chains will sign it with its own identity and then the expectation there is okay you have your artifact provenance and then chains will give you provenance about like okay this build ran these things. And then you can do a similar assertion of this build ran with these images you know did these had these parameters these arguments and you can verify artifacts the same way. So sort of two different approaches to get the same effect. So if you are paying attention here you might notice like okay you know we trust we might trust the artifact now and then we might trust the system space but then how do we trust the system to do the right thing? And this is a bit of an unsolved problem in supply chain security. It's something that people are working on very heavily but yeah you can ask the same question of okay how do we trust get of actions? What if get of actions is compromised? What if the hardware and the VMs that get of actions is running on are compromised? What if the firmware of the chip that they're running on is compromised? You can ask these questions practically it goes back before you have to define what your trust boundary is and what you're comfortable accepting. Most people are probably gonna be okay with trusting get of actions but that's not necessarily your only option so as an example Kubernetes uses their own release cluster that runs on a Kubernetes cluster itself but that is a public cluster that people can audit. It's a known documented cluster where those things come from and so they run it themselves but it gives you that same level of trust like okay if I trust the Kubernetes maintainers and I trust their infrastructure then I should be able to trust the images that they built as well. Yeah so again it's not a one size fits all answer. Another important thing I'll make a note of here is this answer isn't necessarily always a technical one so often when you're running on cloud providers stuff like that there's legal agreements in terms of service and stuff like that so the same way that there's nothing stopping you from stealing something from a store it's like the repercussions that come after that might be enough of a deterrent but it also means that you have that trust and okay there's some recourse if things go wrong and even though that's not a satisfying technical answer it does give you some protection in some level of trust from a business from a pragmatic standpoint. Yeah so again it all sort of comes down to trusting but verifying. Defining your trust boundary for open source projects in particular things that people tend to want and tend to look for are public code public builds on well-known infrastructure so again it doesn't have to be a cloud CI service but it needs to be something that's in the open that people can audit, things like that. This might change for private use cases right if you're a company that is shipping binaries to other people to run maybe it's fine to put that on your private build infrastructure and maybe that's okay and something that your customers are willing to trust but you know that needs to be coming from your build infrastructure with some proof that it's coming from your build infrastructure otherwise how do they know that they're running the right thing. And then for your own private use cases you know maybe everything can be private and you don't need to have visibility because you know the external world doesn't need to have that visibility so long as you know your employees and what you need for your use cases have that visibility to do that auditing later on. Yeah and then finally we wanna be able to enforce these things so all of these things are great but if we don't actually check them and don't actually do anything about it it's kind of useless. So this is a bit focused on Kubernetes workloads so these are three projects you've probably seen them before that do signature and attestation verification for container images right on Kubernetes. This is a great way and a great example of automating these signature checks and these attestation checks in a way that can be more scalable so that you know you're actually using them because you don't want a human just going one by one checking are we actually doing this are we actually doing this. So if you haven't checked these out you should and then sort of another unanswered question is like okay how do we now apply this to other ecosystems binaries things like that and it is something under active discussion in a lot of different ecosystems. So I burned through that way faster than I expected so thank you and I'm happy to take any questions. Yeah, so those are great questions I'll repeat them for the recording talking a little bit about Salsa I kind of glossed over it a little bit Salsa is a project under the open SSF that's what it's trying to do is define levels for how much can you trust the build process of like how this was done. So like level one is basically like okay here's our build process we documented in a markdown file all the way to like level three level four where it's like everything is automated you should be able to take it and run it yourself we populate all the inputs and outputs and make that available to you. So the Salsa levels give you the higher up on the Salsa level you get the more data you have which means the better decisions you can make. I'm trying to, is that like the avenue that yeah, yes, yes. So yeah, reproducible builds are definitely an important factor though not every ecosystem can do them yet. So the idea behind a reproducible build is ideally for a given set of inputs you should get the same output and what's nice about that and if we have those build instructions what we should be able to do is say like Kubernetes has released this artifact and here's the build steps that they used I should be able to take that and run that on my infrastructure and guarantee with a sort of neutral third party like I get the same output and if I get the same output then there's more faith and trust that it actually did the right thing and nothing has been tampered with sort of on those lower levels that we don't necessarily have as much visibility into because if I can take it and run on my own hardware great, fantastic. There are some trade-offs there because that also means for every release we have to like double the compute for every single release but arguably for important open source projects that is something we should be doing having those two antagonistic systems against each other because then it means in order to compromise the whole thing you have to compromise both so there are a lot of useful properties there. Thank you. Other questions? Yeah, that's one way you can do it the important thing though if you're trying to do that setup like if you look into the SolarWinds paper that was how they chose to approach solving the problems that they had with their CIC pipeline was basically run two different clusters that were completely separate and unless they produce the same result don't trust the output of that build. It does help but you also need to take into consideration like if you're running both builds on the same system does it actually give you more security guarantees so something you might need to do is like run one on GitLab, one somewhere else but that also creates additional overhead so it's really a trade off of how sensitive you are to those types of compromises. What is good though is if you have that provenance data if there is a compromise and if there is a problem you should be able to have that metadata to go back and actually audit like okay if GitLab was compromised during this timeframe what builds did I produce during that timeframe that were also affected and we should be able to go back and have that log of here are the artifacts, maybe I remove them maybe I make some sort of notice that these things were affected so even if it doesn't proactively prevent against it you do have some extra auditability benefits from having this provenance data. Yeah, that's a great question. I don't know if there's any I don't know if you know of any existing projects that try to represent that now. Yeah. Yeah, cause so ideally whenever you build an artifact there's probably gonna be some hash associated with it probably like some shot 256 digest. So the simplest way is you know take the output from both build systems for that same artifact the hashes match. That is a great question. I think Tom might have an answer here. Do we have a mic by the way for like the Q and A cause I wanna make sure that this gets captured in the recording. So to repeat that the question was like how do we represent or like what provenance data can we include and I will try to repeat if you have a source of provenance that has the same inputs with the same hashes from different builders that are sufficiently far enough away with the same outputs with the same hashes if you have those two pieces data and compare them against each other that can be a mechanism to get that type of checks to make sure that the reproducibility is the same. We got a light. So I agree with you, a way to identify. So I would agree a way to identify if something's reproducible is by having two different provenances, same inputs producing same outputs. I mean that is what a reproducible build is. There is an additional complication when you record in higher level things like for example in S-bombs. One of the more recent changes this year in SPDX3 and I think actually they've also incorporated back in two is they have recorded a date. What does the date mean? Because if you build it multiple times there are multiple dates. And so increasingly for reproducibility the recommendation is when you have a date it's referring to a date that's unchanging with reference to the source code which usually means it's the date of the last commit so that it's always that date for that particular software and the date of last commit for that particular version of the source code. And therefore the dates that are getting compared are the same, they're not the build dates that the date of last change but that's actually a perfectly reasonable way to assign a date. And that's the kind of little nitty gritty details that people have to get right when they start doing these kinds of cross comparisons. It's quite, it depends on your ecosystems. Some reproducible builds are easy, some are absolutely not but if you care a lot about, I'm worried about subversion you just mentioned Orion. Unfortunately, we had a serious problem with solar winds and that's a technique to deal with it. Yeah, so to build on that a common thing that a lot of build tools are standardizing on somewhat is there's like an environment variable called source state epoch that you can set and a lot of build tools will respect that as like the reproducible build date. So I think Docker respects it, co-respects it. Yeah, there's does. Yeah, yeah, I think there's a site somewhere that like shows all the build tools that will respond to it and recognize it. Other questions? So you've been talking a lot about build provenance and I guess my question is you can have the best, most correct, most reproducible build in the world but if the source that I feed to that build is malicious it's not a good artifact to run. So do you have any thoughts on what we might want to do about that? You're teaming up, Zach. Yeah, so what Zach's saying out here is the same way that we're looking at build provenance source provenance is also important. So signing your git commits is very, very important in order to make sure that things haven't been tampered with from the GitHub side to your actual build infrastructure. So git sign, which is another one of the projects I maintain, it sort of leads into that because you can do the same six store identity-based signing and tie things to your accounts rather than a pre-populated GPG key or SSH key that when you generated it you probably just hit enter, enter, enter didn't put a password. Or you're using the GitHub Webflow.GPG key that everyone else on the internet uses as well and everyone's using the one same key. It is something that I'm personally also like taking up much better looking to and I think in the open SSF there's also some discussions and Salsa actively going on right now around how do we represent source provenance and how do we better represent those in a consistent way across source providers. So if you're interested, reach out to the Salsa community or reach out to myself, I am more than happy to geek out on these things. Cool, all right, thank you so much, everyone.