 Hey Christy. Hey Priya are you eating salsa? Yeah you know I know I have to film this talk but I'm kind of hungry so I thought I have a little snack first. Well that's a good idea. What kind of salsa are you eating? It's Kubernetes flavored salsa. What? Wait Kubernetes flavored salsa? What is that? You know that's actually a really good question and it's actually what we're here to talk about today. Let's get into it. Maybe well actually maybe before we get into it we should introduce ourselves. Yeah that's a good idea. So my name is Priya. I'm a software engineer at Chain Guard. I work on a bunch of open source projects. I'm on the TSC for the six store project so I do a lot of open source security work and I work on Tecton Chainz with Christy. Hey everybody I'm Christy. I'm a software engineer at Google. I've been working on Tecton since it started and lately I've been specializing a lot in the supply chain security area and also working on supply chain security for Google Cloud Build and also working on Chainz with Priya. Yeah fun fact me and Christy were on the same team at Google when we when she started and I would just chill there and I'll watch her create all this awesome stuff. And hopefully we'll be working together soon one day not soon that's not that's absolutely not what I want. That makes it sound like I'm gonna like quit my job or something. That's not happening everything's good all right. Okay cool so today we're going to be talking about supply chain security and focusing on CACP platforms that run on Kubernetes. So supply chain security has been a really hot topic in the past couple of years and it's important to know what supply chain security threats are out there and how they could affect your specific build platform. So if you haven't heard of Salsa yet we'll be talking about it a little bit more in detail in a few slides but it's basically a security framework that you can use to evaluate the security of your supply chain. Salsa gets you most of the way there in evaluating your specific supply chain but if you're also using Kubernetes to build software it's important to consider the unique attributes of Kubernetes when evaluating how secure your supply chain really is. So we'll be talking about how you can use Salsa to evaluate CICD on Kubernetes using Tecton as a case study. We'll dive into a threat model analysis of Tecton on Kubernetes to fill in all of the Kubernetes specifics around how to provide a Salsa compliant build system. So by combining the power of Salsa with a Kubernetes specific threat model you can create a secure supply chain or at least as secure as we know how to make. Awesome so first let's take a look at the threat model that Salsa addresses. So for those of you who have been interested in the software supply chain space you have probably seen this diagram before but for those new to this topic I'm going to run through it really quickly. So as I mentioned before Salsa is a security framework that you can use to evaluate the security of your software supply chain. I think it stands for supply chain levels for software artifacts. So we're looking at this Salsa threat model diagram which is kind of provided by the Salsa project and it's a simplified but overall pretty accurate look at a typical supply chain. We start on the left hand side with the code that a developer is writing that developer is likely going to pull in dependencies which is code that they haven't written. Once they've got all their code the software will be built somewhere typically some sort of bill pipeline like Tecton or something like GitHub actions maybe. Once the software is built it's usually packaged in some way before it's distributed to the final consumer. So you can see that there are many points in this process that are subject to risk and some of these like potential risks are actually called out in this diagram. So on the left hand side we're focused on source integrity. This is looking at how we make sure the dependencies we pull in are trustworthy and how we make sure the code that's being added to our repository is also trustworthy. On the right hand side we're looking at build integrity which focuses on ensuring the build system is secure. You can see that on the diagram there's like a variety of risks that we have to mitigate at each step and since each risk is so different and happens in such a different place we need a variety of solutions to address each of these risks. So Salsa tries to address this variety of risks with a comprehensive list of attributes that your supply chain should follow. So you can kind of think of Salsa as a checklist and the more checks that you have ticked off the more secure your supply chain is. For each risk there's an associated explanation as to how the specific Salsa recommendation mitigates it. So for example say that you are worried that modified source code will enter your repository and thus your final software application. One of the things that Salsa requires to help you identify if this threat has occurred is requiring some sort of provenance which will help you identify all the sources used if it's Salsa compliant. So there are some themes across many of the mitigations that Salsa recommends but they kind of boil down to these two large themes to keep in mind. So the first is that it's important to use this secure build system and the second is that it's important to verify provenance if it's available and so when you're actually building your software it's important that your build system is secure and when you're using software it's important that you verify the provenance of that software. So if you've been following the Salsa project at all you'll know that it's gearing up for a 1.0 release. The build track for Salsa 1.0 actually addresses how to create a secure build system and there are three levels from L1 to L3 with L1 being kind of the most basic level of security and L3 being the highest level of security. So at L1 it's important that provenance exists at L2 it needs to be authentic and at L3 we need to be able to prove that it is non-forgeable. So now let's bring in our case study which is Tecton and let's take a look at how Tecton meets the Salsa build system requirements. So with Tecton we can actually cover all three of the provenance generation requirements that Salsa requires. Tecton provenance is generated by the optional Tecton chain service which is basically a service that runs alongside the core Tecton service in your Kubernetes cluster. Tecton chains subscribes to updates from the Kubernetes API server for executing workloads and once those workloads have successfully completed it's able to generate provenance based on the information in that workload. Not only does it generate provenance it also signs that provenance. So in theory you would think that the provenance is both authentic and non-forgeable since it exists and it's been signed and the information has been taken straight from the API server. Surely the API server won't lie to us but we might actually need to look into this a little bit more before we can be confident in that. The second piece of Salsa that we want to look at is isolation strength. Ideally build systems that adhere to Salsa requirements. Ideally the build processes are isolated and also ephemeral. So the kind of a classic unit of execution for Tecton is the pods that run on Kubernetes. A new pod is spun up for every Tecton workload and so you would think that since we have a new pod every time we kind of get ephemeral and isolated execution for free. This might not actually be the case and we're going to look into this a little bit more. So Priya raises some good points. Out of the box does Kubernetes actually give us isolated ephemeral execution with non-forgeable values that we can generate provenance from? Or should we maybe investigate a bit more about how exactly Tecton builds to be sure? I think we should investigate a bit more. Let's start with zooming in on the Salsa threat model. You may remember this threat in the Salsa model compromising the build process. It looks so simple. It's just this one little box but in reality it's a little bit more complicated than that. This is what build execution looks like in Tecton and this is even a simplified version. But don't worry we're going to step through it and look at the specifics of what's happening at every step to kind of put this whole thing into perspective. So first if you're executing a build workload with Tecton you execute either a task run or a pipeline run. A pipeline run also ultimately just executes a bunch of task runs so it kind of all comes down to task runs. So if you execute a task run what you will do is you'll apply it to the cluster which will store it via the Kubernetes API server in at CD. The pipeline's controller that actually executes everything for Tecton or for Tecton pipelines subscribes to updates from the API server. So as soon as you create that task run the API server will tell the pipeline's controller, oh there's something to execute. So next assuming that you're using build as code the task or pipeline that you want to execute will be fetched from version control or from an OCI registry. So as a side note build as code is a practice where like infrastructure is code or any of the other is codes instead of just writing a script and then firing it off you're actually storing it somewhere probably in version control with a history and hopefully a review process and all the good stuff that you do to code in general but applied to your build configuration. All right so now that the pipeline's controller has the task or pipeline spec it knows exactly what it needs to do to run that so it'll go through the task spec and it translates that basically into a pod that needs to execute. The pipeline's controller will create the pod which will trigger the API server to start the pod via a cooglet which will run the pod which ultimately runs the task run. Once the task run starts it will start pulling images that it needs to run because it is because what it executes is a sequence of containers so each of those images will be fetched from a remote container repository. Also it needs to mount usually needs to mount volumes in order to actually do anything so those can either be volumes from the cluster that it mounts or it can even mount local disks. As the task run executes it will emit strings that are called results which we'll talk about a little bit more shortly which indicate what the task run has actually done for example the digest of an image that it has built and then the pipeline's controller will read those results. They're emitted either by the task run or the underlying pod putting them into the termination message of the pod so that the controller can read them or there's a new experimental feature which has them actually being written to logs which the controller will then read. Finally once the task run or the pipeline run completes the tecton chains controller which is also subscribed to updates on those types from the API server will be notified that execution completed and then it will start looking at the details of what happened and pulling out individual values that it can use to generate and sign prominence. So here's the threat model at a high level. Basically anywhere the data is flowing through the system there's a threat and when we look at how tecton executes on Kubernetes even a very simplified version of just a simple task run not even getting into the details of whether it's trying to pull source or build or pull dependencies you can see that there's a lot of data moving around and you'll also notice that we're leaving some of the Kubernetes internal details like communication between the API server and the kubelet out of the picture but we will touch on that very briefly later. So let's look at each of these threats in detail. Cool so the first threat that we're going to specifically be looking at is around CRDs. So these are custom resource definitions in Kubernetes and tecton relies on custom resource definitions to create this concept of a tecton task or a tecton pipeline which is how we actually execute workloads with tecton to track and track what is actually happening. Tecton chains basically watches these workloads and once the workloads have successfully executed it'll grab all the information from the workload and to determine what values to include in build prominence. So ideally these CRDs should only be updated by the pipeline's controller only the pipeline's controller should be able to set the status should be able to set the results and should be able to say exactly what steps happened in what order. So the risks that we're trying to prevent is mutated CRDs. In theory anyone with access to a Kubernetes cluster can go in and edit whatever resources are in that cluster and since the basis of tecton is the CRDs they are at risk of being mutated by someone who can access them. Another threat is around results this is kind of a sub case of the one that Priya was just talking about. So as tasks execute in tecton they can emit these string values to show what they actually did. So another example of that is if a task is fetching from git it can emit a result to indicate the actual commit it fetched. For example if it's resolving a branch name it's important to know at build time what branch what commit was that branch actually pointing at. So these values are read by the pipeline's controller and they ultimately end up going into the CRD status and then making it into the provenance that chains generates. So it's very important that we know that these are actually generated by the task run that we expect them to be and that's an important part of meeting the Salsa non-forgeability requirements because we have to be able to say that the provenance and the values in the provenance couldn't have been forged by something else. So since these are ultimately stored in the CRD they're subject to the same vulnerabilities that Priya was mentioning around CRDs in general and then there's this additional path where the mechanism that transmits the results to the controller for example belongs that's vulnerable as well. So kind of another small piece of that big diagram that Christie just walked us through is the is tecton workspaces and the risk here is that these workspaces can be mutated. So tecton workspaces are usually backed with pvcs and these pvcs can kind of be shared across various tasks and pipelines in theory but the intention is that usually only a specific task or in an executing pipeline will write data to this volume. In reality any pod in the cluster can mount that same volume and mutate the contents and so this makes it really hard to meet the salsa requirements for isolated execution because anybody with access to the cluster could in theory start a pod mount that volume and then mutate contents. So I mentioned earlier that tecton pipeline supports build as code for tasks and pipelines so they could be stored in version control and there's also a feature called tecton bundles which lets you store tasks and pipelines in an OCI registry. So fetching this data is another place where tecton is vulnerable. It could be through a less likely but still possible man in the middle attack between the repository and tecton or again since the content is stored in a CRD anybody who has permission to modify both CRDs within the cluster could modify the content and could change the definitions of the tasks and pipelines and that's because the tecton pipeline controller after it fetches these tasks and pipelines will store them in a CRD. So another potential threat that we should keep in mind is compromised step images. So the basic unit of kind of Kubernetes in general is the pod and when you kick off a tecton task run you also kick off a pod and obviously that pod is responsible for pulling in container images to run as steps. So we want to prevent against compromised step images and make sure that the images that we're running are the ones that we are intending to run. The last vulnerability that we want to highlight is related to the actual node that the task executes on. One of the salsa requirements is isolation. It's important that workloads that are executing in the build cluster can't interfere with each other so one build workload shouldn't be able to do anything at least within the context of the build system that influences another. But if these pods that are running on the nodes have access to the file system of the underlying node, you can't guarantee that because you don't know the state that one pod has left the node in and you don't know if that's going to change the results from another task that runs on that same node. So there are a couple ways that this could be violated. One is you might do it on purpose. You might explicitly give a task privileged execution permission and that might sound like something you wouldn't want to do, but it actually turns out that for a lot of Docker build scenarios, this is the common way to do it is you have to give this task pod elevated permissions in order to actually be able to use Docker to build. Or you might, even if you're not doing that, there's still the possibility that this could happen because there could be an unpatched vulnerability that allows a malicious workload that's been submitted via a task to escape the pod and actually access the underlying node. So the last vulnerability is related to trying to make sure that tasks that run can't do anything to the underlying node that might impact other tasks. So looking at that threat model, it seems like out of the box, we're not actually getting quite as much for free as maybe we were hoping. There's some unique challenges around leading the, particularly the isolation and non-forgeability requirements when we're running on Kubernetes. So what can we do? Great question, Christine. Let's get into it. So as I mentioned earlier, one of the threats that we want to prevent against is the mutated CRDs. So we want to make sure that someone that has access to the cluster can't just go in, edit a bunch of VML and say that things happened when they didn't. This could be as simple as saying that a pipeline succeeded when it didn't, or it could mean changing the results of a task run to say that potentially an image was built when it wasn't. So how do we protect against this? Ideally, we don't want anyone to be able to tamper with our tactile CRDs at all. But for now, we can at least settle for making sure that we can catch it when they do tamper with our CRDs. And the solution that we've come up with involves another open-source tool called Spire. So Spire is a production-ready implementation of the SPIFI APIs, and it's used to perform node and workload attestation. It basically runs in your cluster alongside Tecton, and they work together to provide this workload attestation. Spire is a pretty complicated topic, so instead of diving deep into how it works, I'm just going to explain how Tecton is able to take advantage of it for now. So when we set up Spire on a Kubernetes cluster, we can set it up to issue Svids or certificates only to the Tecton controller. The controller can then generate signatures for our Tecton objects, whether that's the results that are being created or the status of a Tecton pipeline, and it can provide those Spire certificates for verification to be used later on. Once later on, perhaps in Tecton chains, when build provenance is being created, we can actually verify the signatures against the provided certificates. And in this way, we can catch it if someone has tried to modify the Tecton objects because the signature verification will fail. So while we can't really prevent someone from mutating CRGs at this point, we can be aware of when it happens, and we can cut off the build and provenance process at that point once we realize that something has gone wrong. Another threat is around the content of the tasks and pipelines themselves, and also around the images that these tasks reference. As we mentioned, it's possible, or actually we didn't get into detail about this, but one potential thing that can happen is a task can reference an image, and if it uses a label or something that's not fixed to refer to that image, you don't necessarily know what you're actually pulling in. So we need to ensure that both when we're using remote locations for tasks and pipelines, that those are actually the source of truth, that the definitions of the tasks and pipelines in that remote location are what we're actually executing. And we need to make sure when we're executing images that they are the images that the task author intended, and they're not some other version where something maybe malicious has been inserted in the meantime. So the solution that we have here is a tecton feature called trusted resources, or you might hear reference to trusted tasks or trusted pipelines. This feature lets the authors sign tasks and pipelines using a private key, and then tecton pipelines controller can be configured with a policy that will require tasks and pipelines to be verified before execution. So what this allows for is that the content of the task and pipeline will have to match what the content that was included when the signature was created, and the controller can actually verify that before it executes anything. And an interesting use of this is this makes it possible, there's a tecton open source catalog. And theoretically, anyone can create a catalog of reusable tasks and pipelines that can be used with tecton. So if you're creating a catalog, you can leverage this feature by signing your tasks and pipelines using a key and then publishing a public key. And anyone who wants to use your catalog can configure their controller so that whenever it fetches from the catalog, it'll use that key to verify the contents. And then they know that they're getting the exact tasks and pipelines that you wrote, which hopefully you're also testing and doing code review for and all that good stuff. And so what they're actually executing isn't just kind of some random script that someone wrote, it's these actual verified tasks and pipelines that are backed by version control and signed and all that good stuff. So that way, if anything happened in between when these specifications were signed and when execution began, it can be caught and execution will just stop there. So one problem that doesn't quite address, which is the images that are referenced inside the task, but there's a mitigation, which is if you at least specify your images by digest in the task definition, then we get a little bit more certainty because we can, so if you specify the image by digest and then you sign the entire task, then we know that the image that you meant to reference is the one that's going to get pulled. And then the hash of the image that we're actually running should match the digest that you mentioned. So we get a lot more certainty that way. Another way that you can violate isolation requirements is via Tecton workspaces. So Tecton workspaces are almost always or pretty much exclusively mapped to volumes. And those volumes are subject to being mutated by pretty much any pod in the cluster. We have two potential fixes here. One of them is kind of a short term workaround, which is to always use ephemeral pods during pipeline execution. So if you create, oh, sorry, ephemeral volumes during pipeline execution. So basically if you create the volume before the pipeline is executed and you only use the volume during that pipeline, wow, did I just mix pipeline and I'm just starting over again. I think I'm just mixing up words randomly here. So we have two potential fixes here, which are neither one is super compelling, but one is, oh God, okay, sorry, I'm so sorry, I just like throw myself, okay, sorry, yeah. So we have two potential fixes here. One is kind of a shorter term workaround, which is to always use ephemeral volumes for pipeline execution. So what that means is when a pipeline starts to execute, create a volume, use that volume only in the context of that pipeline, and then destroy it at the end. This on its own doesn't prevent other pods from mutating the volume. To prevent that, you'd have to build a layer on top of tecton to control how the volumes are used. And I think this is a pattern that we do see with people who are using tecton inside an organization and maybe building some kind of layer on top of it. But it is obviously a lot of work to build and maintain this layer. But the idea would be that if your users are only accessing tecton through this layer, then you've at least reduced the attack surface to people who have privileged access to that cluster, which shouldn't be, basically if you're not letting users run random pods in the cluster, then you can have more assurance around what pods might be able to take advantage of this vulnerability where you're writing to a volume that's being used by another pod. So not an answer on its own, but at least an approach that you could use that would improve the situation. There's also a longer term fix that's a bit more theoretical called tecton artifacts. So this is an abstraction that we're currently designing and adding to tecton. And one of the requirements is that artifacts should be immutable. So once an artifact is produced by a task, nothing should be able to ever change the content. And it also should be possible to verify that it wasn't changed. So if we, once we have this feature, and you can switch to using tecton artifacts throughout the pipeline instead of workspaces, that is a long-term fix for the issue. But it's still very early stages, so we don't actually even know what the syntax looks like, let alone you can't quite use, you definitely can't use the feature just yet. So watch this space. Cool. So the last part that we'll be discussing is tampering with the underlying node. So again, to meet Solace's isolation requirements, tasks should not be able to impact any node that they run on. And tasks should also not be able to affect the execution of other tasks. So the first step in fixing this is to not allow privilege escalation within the cluster. So as soon as a pod has allowed access to the underlying node, if that node is used for another pod, then isolation guarantees are kind of out the window. So the solution for this might kind of depend on exactly how you run your cluster and kind of what environment you run it in. But there are a few different options out there that you could explore, something like Cata containers or Kubernetes cloud provider-specific features like GKE sandboxing. There's also an option to use one-time use VMs instead of pods so that you don't have to rely on an underlying node between pods. So by combining these two threat models, we were able to get a pretty comprehensive picture of the threats to Tecton built on Kubernetes. With Solace, we were able to identify the threats outside of the build system. And then Solace also set standards for how build systems need to execute in order to be trusted. And then by diving in to the details around the build piece and creating a threat model for our build system on Kubernetes, we were able to identify a number of specific additional threats and how to mitigate them. Some of these threats mapped directly to Solace standards, for example, around isolation. And some of them were more specific to how this exact CICV system worked. There were also a few things that came up in our threat analysis that aren't addressed by Tecton. For example, we have a way of working around compromise step images by specifying them by digest and then verifying the tasks that specified them. But Tecton could do more. For example, Tecton could actually try to fetch provenance of those images and verify it before it uses them. Of course, in order to be able to do that, we have to actually be making provenance for more images. So we kind of have a bit of a chicken and an egg problem. But in the future, that's something that maybe Tecton could do. Also, you might have noticed the solutions for dealing with volume isolation are pretty hand wavy. They require either building your own system on top of Tecton or waiting for a feature that doesn't yet exist. Also, it's going to be interesting to try to figure out how we can meet the salsa isolation requirements around this, but still have caching. So you don't actually have to refetch large pieces of data every time you're using them. And then lastly, we didn't get into this too much. But for a complete picture, you should consider the integrity of the cluster itself. So everything that we've talked about has kind of been around protecting the execution of the build system. But then if you really want to be thorough, you have to start thinking about some of the considerations outside of that, like the binaries that are actually being used to run your Kubernetes cluster like the cooblet and the API server. And we talked about adding spire to the picture. What about the binaries that are used to run spire? And what about the spire agent? Is it possible to compromise those? And probably the easiest way to do that would be if you are in charge of how those pieces are actually being built and distributed and how your organization is using them. And you are malicious, maybe you could insert something into the spire agent binary that makes it pretend to be a different workload. So we haven't fixed everything, but we've at least reduced the surface and identified what the other places are that we need to look into. So in summary, if you're working on any sort of CIC CD system and want to do a threat analysis of it, Salsa is a great place to start. The Salsa standards really help provide a comprehensive look at everything you should be keeping in mind. And it's a really great starting point for threat modeling. Combining the Salsa standards with the specifics of the build platform that you're using will show you exactly where your security gaps are and can help you get started figuring out how you want to mitigate some of those risks. And if you're looking for a build system that was created with Salsa in mind, or you're looking to build a CICD system on top of an existing platform, consider using Tecton. And that's Kubernetes flavored Salsa. Ah, that sounds like pretty good Salsa. And I think after all that talking, I'm feeling pretty hungry. Can I have some of that Salsa? You know, Christy, you live a thousand miles away from me, but we'll find a way. I think I don't know what the import rules are, but I feel like maybe if you shipped it right now, it might make it to me before the end of the day. I don't know. Let's give it a go. All right, sounds good. If you're interested in more on Tecton and Salsa, please join the Tecton Supply Chain Security Working Group. And also a big thank you to Chitron Patel at Google who created the Tecton Threat Model. And lastly, I wrote a book on continuous delivery. And in the slide, there's a discount code that will be good till the end of April if you want to get it from Annie. And that's it. Thanks for watching our talk. Awesome. Thank you, everyone.