 Thank you everyone for having me today. My name is Dan Lawrence. I'm a software engineer at Google. I've been there about seven years and I've been working in the cloud and platform as a service and container in Kubernetes space. It's been very exciting. Today I'm here to talk about standardization and software delivery. We're talking about continuous integration and continuous delivery. It's one of the projects that we've been working on in the CD Foundation, where I serve in the technical oversight committee. I'll be talking about the Tecton project specifically. I'm going to be starting out with some examples and security how to motivate the talk. So I'm starting with supply chain security. We've talked about supply chain a couple of different times and a couple of other talks today. But supply chain for software and why standards matter. So with this example here, we're going to talk about a little quiz. So if everybody remembers what this clip art piece here at the top is, before the cloud and Dropbox and Google Drive, this is how people used to transfer files between computers. If you found one of these on the sidewalk outside of your work, would you take this inside and plug it into your computer and open up files and take any binary agent programs on there and run them? I hope that everyone's security teams have thought them long enough not to do things like this. You have no idea what's on that device. You have no idea what those programs are. Of course you wouldn't want to run those. And if you happen to work in a data center serving production workloads, I really hope that you don't take this inside your data center and run it as you find out. You'd be exposing all of your trusted user data, all of your production sensitive workloads to code that you have no idea what it does. You don't know who wrote it. It could exfiltrate user data. It could delete data. It could mind cryptocurrency. There's a whole bunch of things that could happen here and that would be a terrible idea to just take one of these things inside and run programs you find out. That seems a little bit silly though. So let's compare this to normal software development practices. Say you're going to write a simple HTTP web application in Node.js. The first thing you might do is install a package. In this case we're going to use Express, a common one. Now we're typing NPM, install Express, and you can see that it describes about 50 different packages from 30 something maintainers and installs those next to our code before we even started writing anything. I think this is something like 5,000 lines of code that we have no idea where it came from. So this really isn't that much different than taking a USB flash drive into your data center and giving access to root credentials and a trusted user data. It's pretty scary. Without being able to audit all these lines of software and let's be realistic, nobody has time to look at 5,000 or 50,000 lines of open source code. We have no idea what's going on there. So basically open source is under attack. These are called supply chain attacks where people get coded to you through to third party dependencies. And this isn't a contrived example. I'm not trying to scare people for no reason. Real attacks like this are happening every day. Just last week or two weeks ago now, the ZD network ordered 17 packages we're taking down from PyPy, the Python package index. This was a pretty clever two-part supply chain attack. The first one used a technique called typosquad. There's a library called Jellyfish that's pretty common in Python. And the attacker uploaded another version of that package where one of the L's was switched for a one or an I character or something like that. So it's very hard to notice unless you would care for it. After squatting attacks or nothing new, they happen all the time. They basically rely on people to copy and paste something without looking at it. They're going to make a small error all typing in command. Thankfully though, if you don't just type one of these commands, then you're not likely to get affected by this. In this case, the vulnerable Jellyfish package downloaded a payload from GitHub and then executed it dynamically. And it was caught doing things like extra training, GPG, and SSH credentials. What made this attack pretty clever though is that it was two parts. The same attacker uploaded another package, another set of packages. This package itself was pretty innocuous. It just contained some data parsing functions that worked in Python 3. So if you looked at the Python 3 data utility code, you wouldn't have found anything wrong with it. The tricky part is that it declared a dependency on this Jellyfish library, and not the real one, the one with the type one inside it. So if you just happened to scroll through its dependencies without looking character by character, you would have thought you were getting the real Jellyfish library, but instead you were running your subject terminal code execution and you could lose your SSH and GPG keys. And there's lots more. This is far from the first time something like this has happened. I'm going to go through all of these examples, but there's a whole bunch of different techniques, and the creativity shows that there's no one size or shape for these attacks. This is something that everybody needs to be concerned about today. Bootstrap's ass is one of the biggest ones here. This is a very popular Ruby library. The committers in this case did nothing wrong to themselves. One of them had their credentials stolen. So their credentials to upload this package to Ruby gems were stolen. And the attacker uploaded a new release containing code that was never actually checked in the GitHub. So if you looked at the code for that version on Git or a popular repository that did do a review of everything line by line yourself, you still would have missed this. The supply chain for these packages has a whole bunch of different points for vulnerabilities that can be injected. It's very difficult to protect and install them. The one that I really like, and I kind of sick and sad way, is the Docker 1-2-3-3-2-1 version. This carry data technique was basically a long con where people uploaded a whole bunch of actually useful container images on Docker out when it first came out. These images were used for years. After building this use case out for a whole bunch of time, they injected some code to my crypto coins in the next versions of the image. So these people actually started out by building something useful, got it used, and then added code to my crypto coins. There's really no way to prevent these things without actually looking at everything that you're using and being aware of it. I've been live on Docker for over a year before I even noticed. It's pretty terrifying. So how do we fix this? What does software delivery and delivery foundation have to do with this? All these things are attacks on supply chains. Supply chains are basically how we handle software delivery. Software delivery is the process of getting our code for ourselves to our users. As an engineer, we like to think code is a great thing. We like to write code. Just like physical goods in a warehouse makes you get damaged, makes you get broken. They don't actually add any business value until we get that code delivered into the hands of our customers. That's where software delivery comes in. Software delivery is the process of converting code to something sitting in a gear robot or it's just something useful that our customers can make use of and add to business value. Unfortunately, this Kelsey Hightower tweet says we can't really have a talk about software delivery anymore. There's no single continuous integration and delivery setup that will work forever. Every company is different. These pipelines basically encapsulate a company's culture. What processes are required, which I think checks, code review techniques we have to use. So everybody has a different software delivery pipeline because they have different business use cases and different customers. That leaves us in a pretty sad state. Since software delivery is required you don't sell our software delivery pipelines. You never need to custom one. You end up with these Rube Goldberg machines. Everyone stitches together with bash and configuration files. Getting something good enough here is more than enough in these cases. People don't maintain these. They're hard to make portable. As soon as you get it working, you tend to call it a day. This delivery pipeline has all the data you need to start auditing your supply chain and figuring out what's coming in and what are the possibilities. But it's buried. It's scattered. It builds logs inside of Jenkins servers inside of commit history. There's no real way to get things out of here. It definitely doesn't scale when we try to take a broader view of software delivery in our supply chain. So what do I mean by that? In our Node.js example, when we installed one package, we ended up with 50 packages. Those dependencies have dependencies. The user of one of those libraries in those packages, so they all have their own sets of root Goldberg machines. They're used to deliver their code from their developers keyboards to you as their end user. I'll just step back one level here, but this really applies transitively to all the dependencies of our dependencies. These root Goldberg machines have all the data we need to start securing supply chains. Again, we can't access it because it's buried in different file formats, different specifications. We'll take another example here and talk about some of the different types of delivery pipeline. I said I've been an engineer at Google for a long time. Google's pretty famous for doing things a little bit differently than the rest of the industry. Not necessarily better, just different. Google uses a Mono repo, which means we have one giant source repository for everything. Our first-party code, our third-party code, the source code for all of our tools, and our production configuration, all lives together inside of this environment. We check it into this Mono repository and a special directory. And since we're one company that uses the same build system across all these libraries and tools, we don't have the kind of scaling root Goldberg machine problem. We just have one. We have one team that maintains most of the stuff, so we can write metadata in a standard format. When I say standard, I mean standard internally. This is something that wouldn't really make sense in the outside world, but this is a metadata bus that's queried full by the rest of the company. So this metadata bus kind of encapsulates the entire delivery pipeline inside of Google. This includes things like code review metadata. You can see who reviewed every single change before it goes in. We can see what tests were run on it. We can see the results of those tests. And this applies all the way to production. So with this metadata, we can apply policy. Do some pretty interesting things. So this is just some of the examples of policy we can apply on it. We have providence information. So from any given binary running in production, we can see exactly who authored the changes inside of it and we can see what changes are in each new version. We have metadata for the entire build process as well. Every once in a while, the Go compiler, the Java compiler, Python interpreter has a bug or a security issue inside of it. And it fixes to rebuild all the binaries from source again. So we can do queries in our production environments to figure out exactly which binaries need to be rebuilt and when they're lost. Another cool feature here is that we can actually apply policy at runtime as well. So when developers build things locally on their machines and you want to test them in a production-like environment, we can do that safely. Because developers can run their code in a production environment, but we have metadata about exactly how these things were built and signed. And so we can read policies and permission systems on our databases and say only things that have been checked in or reviewed by a certain number of people. We're using our hermetic build system to have access to our private data stores. Developers only have access to sandbox versions. So I hope I've made the case so far that delivery pipelines, once we take a broader view, are the right place to start thinking about securing software supply chains, especially when it comes to open source. They have all the metadata. We just have to extract it and get it into standard formats where we can start to make use of it. So we have this kind of weird paradox where we apply all these standards and best practices and code review and unit testing to our code, the first-party code that we write. But when it comes to third-party code, we kind of turn a blind eye to it. We trust that somebody is looking at it and when everyone trusts that someone else is looking at it, we often run into problems like those supply chain attacks I talked about before. Google has done a whole bunch of work in this area because we have a unique system. There's nothing special about that system. Anybody that spent enough time and energy can set this up themselves. But the point here is in the open source community, we shouldn't all have to do that ourselves. If we start to talk about standardization in this space and come up with standard formats from metadata and artifacts that we can query, we can start to build a shared metadata bus across the entire open source supply chain. If we do that together, then we can prevent everybody from having to waste time doing it themselves. So let's get into some more details about the standardization and what's missing and what's going on here. So this is a data problem. All this metadata together across all the different software projects going on in the open source community turns this into a big data problem. We need ways to extract the data from the software development going on in the community today and make it available and accessible to the rest of the community. Again, this metadata already exists. We've talked about it before. It's just in text files and build logs and get commit logs. We need ways to get it out. And that's where standardization can go into play here. Once we start thinking about applying best practices to how to auto open source, we can start to figure out exactly what metadata is needed, what formats are the best, how to query it, how to store it. Then we can define standards based on those practices. We can't do this just in thin air. We have to actually start trying to solve this problem in a few use cases first. And from there, we can start to build tooling to make this easier. One of the great lessons of software development is that if something is not easy to do, if we don't have compilers that are automatically outputting metadata of what sources went in and two binary build by default, then people aren't going to take the time to do this. And as we go forward, we can keep repeating the cycle, pushing the industry forward. Standardization isn't necessarily the most exciting piece of software development, but it isn't necessary. And that's what lets us build things on top of floor-level systems. So we can get specific here. There are a couple different things. Starting at the top left, pipeline definitions. So everyone has these bash scripts and configuration files. And you need documents checked in describing how software is built and released. Since this is in no standard format, it's hard for us to look at our dependencies and figure out which best practices they're already following. We need standard notations for explaining how software is built that apply to all products and all our dependencies so we can make sure that they're following the same standards and best practices that we have. Stuff for toolchain declaration as well. Talked about bogies in the go compiler being found that require security updates. There's also an old paper called Reflections on Trusting Trust. It takes into this kind of scary or deeper level. But just because you know exactly which source files went into a compiler and which binary came out, if you don't actually have all the source code for the compiler itself too, then you can't really trust anything. And this applies to the compiler that built so it starts to get kind of scary. Just because we know something was built using the go build command or the Java C command, if we don't know exactly which version of the tool was used, then we can't figure things out. Containers help a lot here because they actually let us encapsulate the entire file system used in a build so we can examine it later. Hope we need other standards to declare these things as well to be able to apply policy on top of it. And down in the bottom left, source providence. And then we can actually track the individual authors and the actual people that worked on it. Gig commit logs contain email addresses, but if people aren't assigning things or using DCL or other systems, they really have no proof that something going to get history actually came from an individual. We need standards for how to attach this metadata and require this metadata to be attached on gig commits or whatever source code management system you happen to be using so that we can figure out exactly what other package that developer touched. And then we need to wrap all this up in metadata formats that can be exchanged as easily as we can PIP install or NPM install these. This metadata needs to transfer with the package so otherwise people aren't going to make use of it and we need easy ways to hail this metadata back and forth between different organizations too as our effect changes. So putting it all together, this looks sort of similar to the Google diagram except we need this and for the rest of the open source community that needs to work with the tooling that we already have. As external software is built we need ways to exchange prominence with the artifacts. When we develop our first party software we need to use those same mechanisms and we need to define the pipelines for these in standard formats so that we can share our best practices and audit the way the rest of the software is built and released. All these development processes should be outputting metadata as well and all this should be in a standard community-owned metadata bus so that we can apply production policies as well. That's where the Linux Foundation and the Continuous Delivery Foundation come in. The Continuous Delivery Foundation was started up earlier this year on which a founding member says a sub-foundation in the Linux Foundation. I probably forgot to add some as members are joined over the year but the growth has been very fast. Fujitsu recently just joined as well and we have them in the foundation. The work there is organized in projects so the Tekton project that I work on is in the Continuous Delivery Foundation. We also have SIGs or Special Interest Groups working on cross-cutting efforts and the interesting one in this space is the Security Special Interest Group which is focusing on supply chain security and trying to work towards standardization of some of these formats. So starting with the Tekton project. Tekton Pipelines is a project where we start letting people declaratively specify software delivery pipelines. It's kind of the first standard that we talked about before. It comes in two parts. There's a way to declare these pipelines and the Tekton also provides a trusted execution environment to execute these pipelines in. Once you've declared everything it's sets of containers that run in a DAG or a graph. We can start to write metadata of every step of this way to make sure these things can't be tampered with. There's a whole bunch of other projects going on in this area too. Graphius is an API designed to store metadata about artifacts. So you can attach things like vulnerability scan results, source provenance, build results showing exactly how an artifact was produced and query them over to the standard API. Intoto is another project in the CNCF that takes a different take on these concepts. Intoto describes themselves as farm-to-table supply chain security. That means they declare a couple different file types that declare exactly how software is supposed to be produced. They can be exchanged between parties. Developers that have access to certain keys sign things as they execute commands and then you can play those back to verify the things were built the way that they're supposed to be built and things weren't tampered with and there's a lot of classification work going on. There's two efforts. The S-bomb or software build materials effort is happening in the object management group and the SPDX product, software data exchange is happening right here in the Knowledge Foundation. SPDX is one of the most common to use S-bomb formats for attaching metadata to software. It's been mostly focused on licensing data. All of these concerns I talked about for security also apply to licensing. It's a very common, very similar use case. If you're polling encoded and you don't know the licenses about it, then you don't know that you're allowed to use it. The same thing applies to security. The new S-bomb or the new SPDX 3.0 effort sort of is not extending a specification to that support for security and prominence and authorship data. How can people help and how can we fix this all together as an industry? The summary here is that we all need to start taking supply chain attacks and our open source for security seriously. We can't trust that everyone else is looking out for these things. We can't rely on the community here. We all have to actually take control of the software and our dependencies that we're using. If we do want to fix this as an industry, this is a standards, automation and data problem that we can all solve if we work together on it. We can't really wait any longer. These attacks are rising and happening more and more every day. If this is interesting to you, then we're working on it right now in the Knowledge Foundation. Please reach out and get involved. Thank you for having me today. There are some helpful links here for how to get involved and start contributing to some of these projects in this space.